Parse HTML

In this section we will start writing code for the actual link collection. We will create a new file called link_finder.py that will go through an HTML file and find all the links inside of it. To do that, we will use a Python class called HTMLParser. Here is the code:

from html.parser import HTMLParser
from urllib import parse  # defines a standard interface to break URL strings up in components


class LinkFinder(HTMLParser):  # creates a new class that will inherit from HTMLParser

    def __init__(self, base_url, page_url):
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()

HTML links can have multiple attributes, such as class, target, and such. Since we only need the href value, we can use the code below to extract that value:

    def handle_starttag(self, tag, attrs):
        if tag == 'a': # checks if the tag is a link
            for (attribute, value) in attrs:
                if attribute == 'href': # checks if the link attribute is href, which represents a link
                    url = parse.urljoin(self.base_url, value) # if a link is a relative link, adds the domain name to it.
                    self.links.add(url) # adds the URL to the set

    def page_links(self):
        return self.links

    def error(self, message):
        pass

Parse HTML

Python web crawler course