What is a web crawler?

A web crawler is a program that browses the World Wide Web in a methodical fashion for the purpose of collecting information. Web crawlers (also called web spiders or bots) are usually used by search engines to update their web content. They can also be used for web scraping, a process of extracting information from websites.

A web crawler starts with a list of URLs to visit (called the seeds). As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit (called the crawl frontier). Perhaps the most common crawler on the web is Googlebot, used by Google to collect documents from the web to build a searchable index for the Google Search engine.

In this course we will write a simple web crawler in Python that will go through all the pages that make up a website and gather all the links. Our program will be multithreaded, which will allow our program to run faster. After we’re done, you should have a functional web crawler that you can use to gather all the links (or other elements) from a domain.

Geek University 2022