Create the crawler
In this chapter we will create a new file that will contain the code for the actual web crawler. The crawler will get the URLs that need to be crawled from the waiting list, connect to the pages, get their HTML files and send them to the link_finder.py program that will extract all the links. After the spider is done with the webpage, it will move its link to the crawled file.
We will use class variables because they can be used by all our spiders. Here is the code we need to place in the spider.py file:
from urllib.request import urlopen # a module that enables us to connect to webpages from link_finder import LinkFinder from domain import * from general import * class Spider: project_name = '' # the name of the project base_url = '' # usually the homepage URL domain_name = '' # this variable will help us ensure that we are connecting to a valid domain name queue_file = '' # the location of the queue file crawled_file = '' # the location of the crawled file queue = set() # creates a set for the links in queue crawled = set() # creates a set for the crawled links