Encyclopedia > Distributed crawling

Article Content

Distributed web crawling

Redirected from Distributed crawling

Distributed web crawling is a technique used in Internet search engines employing many computers to do the web crawling[?] necessary to index the Internet. The idea is to spread the resource requirements of computing power and bandwidth across many computers and network connections.

As of 2003, most modern commercial search engines use this technique. Companies such as Google use thousands of individual computers in multiple locations to crawl the Web.

Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique in its Grub distributed web-crawling project.

The following is a proposed solution, but does Grub (or others) actually use this algorithm? One solution to this problem is using every computer connected to the Internet to crawl some Internet adresses (URLs) in the background. After downloading the pages, the new pages are compressed and sent back together with a status flag (changed, new, down, redirected) to the powerful central servers. The servers manage a large database and send out new URLs to be tested to all clients.

See also:

distributed computing

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!