In this course we will write a simple web crawler in Python that will go through all the pages that make up a website and gather all the links. Here are the exact steps that the crawler will go through:
- Create a directory for each website, if not already created.
- Create the queue and crawled text files for each project.
- Create sets that will be used to speed up the crawling process.
- Write a multi-threaded crawler that will crawl the web page and extract the links from the page HTML.
- Add an URL that needs to be crawled to the waiting list.
- When an URL is crawled, add that URL to the crawled list and get the new URL from the waiting list.
- Repeat the process as long there are URLs in the queue.txt file.