Crawler description

In this course we will write a simple web crawler in Python that will go through all the pages that make up a website and gather all the links. Here are the exact steps that the crawler will go through:

  1. Create a directory for each website, if not already created.
  2. Create the queue and crawled text files for each project.
  3. Create sets that will be used to speed up the crawling process.
  4. Write a multi-threaded crawler that will crawl the web page and extract the links from the page HTML.
  5. Add an URL that needs to be crawled to the waiting list.
  6. When an URL is crawled, add that URL to the crawled list and get the new URL from the waiting list.
  7. Repeat the process as long there are URLs in the queue.txt file.
Geek University 2022