Add links to queue

After we gather the links from a webpage, we need to add them to the queue so they can be crawled as well. In this section we will write a function that will add new links to the queue.

The code that goes in the spider.py file, under the Spider class:

    @staticmethod
    def add_links_to_queue(links): # the function that will take a set of links and add them to the waiting list
        for url in links: # loops through the set
            if (url in Spider.queue) or (url in Spider.crawled): # checks if links are already in the waiting or the crawled list
                continue
            if Spider.domain_name != get_domain_name(url): # checks if the domain name is present in the URL.
#This ensures that the crawler will crawl only pages on the targeted website, and not the external links present on the website.
                continue
            Spider.queue.add(url) # adds link to the waiting list

    @staticmethod
    def update_files(): # updates the files
        set_to_file(Spider.queue, Spider.queue_file)
        set_to_file(Spider.crawled, Spider.crawled_file)
Geek University 2022