Creating the spider
In this section we will write a code that will ask a user to enter two parameters: the name of the project and the homepage. These are the only parameters the user will be required to enter. Here’s the code that needs to be entered in the new file called main.py:
import threading # since we will run multiple spiders simultaneously, we need to import the threading module from queue import Queue from spider import Spider from domain import * from general import * PROJECT_NAME = '' HOMEPAGE = '' DOMAIN_NAME = get_domain_name(HOMEPAGE) # the function will get the domain name from the HOMEPAGE variable QUEUE_FILE = PROJECT_NAME + '/queue.txt' # the location of the queue file CRAWLED_FILE = PROJECT_NAME + '/crawled.txt' # the location of the crawled file NUMBER_OF_THREADS = 2 # this number depends on your operating system queue = Queue() # represents the thread queue Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME) # the first spider will create the project directory and the data files
There are no constants in Python, but if you have a variable that will not change, it is a convention to write it in all caps. The number of threads depends on your system and represents the number of spiders that will run simultaneously.