Crawl a page

In this chapter we will write a method that will crawl a web page. We are going to pass the URL for our program to crawl, and the crawler will find all the links on the page and add them to the waiting links. Once the links have been collected, the page URL will be moved to the crawled file. This ensures that the same page is not crawled twice.

Add the following code to the spider.py file, under the Spider class:

    @staticmethod
    def crawl_page(thread_name, page_url): # the method that will start the crawling
        if page_url not in Spider.crawled: # ensures that the page wasn't already crawled
            print(thread_name + ' now crawling ' + page_url) # displays the page that is being crawled
            print('Queue ' + str(len(Spider.queue)) + ' | Crawled ' + str(len(Spider.crawled))) # prints how many links are in the waiting list and how many links have already been crawled
            Spider.add_links_to_queue(Spider.gather_links(page_url)) # adds the links to the waiting list
            Spider.queue.remove(page_url) # removes the page that has been crawled from the queue set
            Spider.crawled.add(page_url) # adds the page that has been crawled to the crawled set
            Spider.update_files() # converts sets to files

Crawl a page

Python web crawler course