Gather links

In this section we will create a function that will get the HTML of a page and call the function that will extract only links. Because Python converts human readable HTML it gets into bytes, we also need to convert those bytes to human-readable characters before passing them to LinkFinder.

Write the following code in the spider.py file, under the Spider class:


    # Converts raw response data into readable information and checks for proper HTML formatting
    @staticmethod
    def gather_links(page_url):  # the method that will crawl a page and return the set of links
        html_string = ''  # the variable that will hold the HTML string
        try:  # the try...except statement ensures that our program won't crash if there is an exception
            response = urlopen(page_url)  # the variable that will hold the response (in byte data)
            if 'text/html' in response.getheader(
                    'Content-Type'):  # checks if the crawled page contains an actual HTML data
                html_bytes = response.read()  # reads the response (in byte data)
                html_string = html_bytes.decode("utf-8")  # converts the raw data into an HTML string
            finder = LinkFinder(Spider.base_url, page_url)  # creates the LinkFinder object
            finder.feed(html_string)  # parses the HTML data
        except Exception as e:
            print(str(e))  # prints the error
            return set()  # returns an empty set
        return finder.page_links()  # returns links

Gather links

Python web crawler course