Crawl a page
In this chapter we will write a method that will crawl a web page. We are going to pass the URL for our program to crawl, and the crawler will find all the links on the page and add them to the waiting links. Once the links have been collected, the page URL will be moved to the crawled file. This ensures that the same page is not crawled twice.
Add the following code to the spider.py file, under the Spider class:
@staticmethod def crawl_page(thread_name, page_url): # the method that will start the crawling if page_url not in Spider.crawled: # ensures that the page wasn't already crawled print(thread_name + ' now crawling ' + page_url) # displays the page that is being crawled print('Queue ' + str(len(Spider.queue)) + ' | Crawled ' + str(len(Spider.crawled))) # prints how many links are in the waiting list and how many links have already been crawled Spider.add_links_to_queue(Spider.gather_links(page_url)) # adds the links to the waiting list Spider.queue.remove(page_url) # removes the page that has been crawled from the queue set Spider.crawled.add(page_url) # adds the page that has been crawled to the crawled set Spider.update_files() # converts sets to files