Create the crawler

In this chapter we will create a new file that will contain the code for the actual web crawler. The crawler will get the URLs that need to be crawled from the waiting list, connect to the pages, get their HTML files and send them to the link_finder.py program that will extract all the links. After the spider is done with the webpage, it will move its link to the crawled file.

We will use class variables because they can be used by all our spiders. Here is the code we need to place in the spider.py file:

from urllib.request import urlopen  # a module that enables us to connect to webpages
from link_finder import LinkFinder
from domain import *
from general import *


class Spider:
    project_name = ''  # the name of the project
    base_url = ''  # usually the homepage URL
    domain_name = ''  # this variable will help us ensure that we are connecting to a valid domain name
    queue_file = ''  # the location of the queue file
    crawled_file = ''  # the location of the crawled file
    queue = set()  # creates a set for the links in queue
    crawled = set()  # creates a set for the crawled links

Create the crawler

Python web crawler course