A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
A Web crawler may also be called a Web spider,an ant, an automatic indexer.
Web search engines and some other sites use Web crawling or spidering software to update their web content or indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly.Crawlers can validate hyperlinks and HTML code.
Uses:
- Webcrawlers are used for
- indexing pages for search engines
- archiving the web (e.g. The Internet Archive www.archive.org)
- analysing the web
What are the requirements for web crawlers?
- distributed
- scalable
- portable
- high performance
- (e.g. 50 million documents / 17 days with 4 compaq server)
- continuous
- extensible
- polite
Types of Crwler:
- Yahoo! Slurp is the name of the Yahoo! Search crawler.
- Bingbot is the name of Microsoft's Bing webcrawler.
- FAST Crawler is a distributed crawler, used by Fast Search & Transfer, and a general description of its architecture is available.
- Googlebot is described in some detail, but the reference is only about an early version of its architecture.The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction.
No comments:
Post a Comment