Create queue and crawled files

We will create two files inside the project folder for each website we crawl:

  • queue file – this file will contain all URLs that it finds on the web page we are trying to crawl. This file will serve as a sort of a waiting list.
  • crawled file – once the URL has been crawled, it will be moved from the queue file to this file. This will ensure that we do not crawl the same web page multiple times.

Our program will accept only the homepage URL as the parameter. The two files listed above will be created and stored inside the project’s directory. To create files, we will need to provide two arguments:

  • project_name – the name of the project folder.
  • base_url – the homepage URL of the webpage we are trying to crawl.

Here is the code you need to write in the general.py file:

def create_data_files(project_name, base_url):
    queue = os.path.join(project_name, 'queue.txt')  # sets the path for the queue file
    crawled = os.path.join(project_name, 'crawled.txt')  # sets the path for the crawled file
    if not os.path.isfile(queue):  # checks whether the file already exist
        write_file(queue, base_url)  # creates a file and inserts a homepage URL
    if not os.path.isfile(crawled):
        write_file(crawled, '')  # creates an empty crawled file

Here is the function that will create new files:

def write_file (path, data): 
    f = open(path, 'w') # creates a file for writing
    f.write(data) # writes data to file
    f.close() # closes the file

You can verify if the code above works by calling the create_project_dir() and create_data_files() functions. For example, the code below should create the folder called GeekUniversity with two files inside: queue.txt and crawled.txt. The queue.txt file should contain the https://geek-university.com URL:

create_project_dir('GeekUniversity')

create_data_files('GeekUniversity', 'https://geek-university.com')
Geek University 2022