Creating the spider

In this section we will write a code that will ask a user to enter two parameters: the name of the project and the homepage. These are the only parameters the user will be required to enter. Here’s the code that needs to be entered in the new file called main.py:

import threading # since we will run multiple spiders simultaneously, we need to import the threading module
from queue import Queue
from spider import Spider
from domain import *
from general import *


PROJECT_NAME = ''
HOMEPAGE = ''
DOMAIN_NAME = get_domain_name(HOMEPAGE) # the function will get the domain name from the HOMEPAGE variable
QUEUE_FILE = PROJECT_NAME + '/queue.txt' # the location of the queue file
CRAWLED_FILE = PROJECT_NAME + '/crawled.txt' # the location of the crawled file
NUMBER_OF_THREADS = 2 # this number depends on your operating system
queue = Queue() # represents the thread queue

Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME) # the first spider will create the project directory and the data files

 

There are no constants in Python, but if you have a variable that will not change, it is a convention to write it in all caps. The number of threads depends on your system and represents the number of spiders that will run simultaneously.
Geek University 2022