Create queue and crawled files
We will create two files inside the project folder for each website we crawl:
- queue file – this file will contain all URLs that it finds on the web page we are trying to crawl. This file will serve as a sort of a waiting list.
- crawled file – once the URL has been crawled, it will be moved from the queue file to this file. This will ensure that we do not crawl the same web page multiple times.
Our program will accept only the homepage URL as the parameter. The two files listed above will be created and stored inside the project’s directory. To create files, we will need to provide two arguments:
- project_name – the name of the project folder.
- base_url – the homepage URL of the webpage we are trying to crawl.
Here is the code you need to write in the general.py file:
def create_data_files(project_name, base_url): queue = os.path.join(project_name, 'queue.txt') # sets the path for the queue file crawled = os.path.join(project_name, 'crawled.txt') # sets the path for the crawled file if not os.path.isfile(queue): # checks whether the file already exist write_file(queue, base_url) # creates a file and inserts a homepage URL if not os.path.isfile(crawled): write_file(crawled, '') # creates an empty crawled file
Here is the function that will create new files:
def write_file (path, data): f = open(path, 'w') # creates a file for writing f.write(data) # writes data to file f.close() # closes the file
You can verify if the code above works by calling the create_project_dir() and create_data_files() functions. For example, the code below should create the folder called GeekUniversity with two files inside: queue.txt and crawled.txt. The queue.txt file should contain the https://geek-university.com URL:
create_project_dir('GeekUniversity') create_data_files('GeekUniversity', 'https://geek-university.com')