Parse HTML
In this section we will start writing code for the actual link collection. We will create a new file called link_finder.py that will go through an HTML file and find all the links inside of it. To do that, we will use a Python class called HTMLParser. Here is the code:
from html.parser import HTMLParser from urllib import parse # defines a standard interface to break URL strings up in components class LinkFinder(HTMLParser): # creates a new class that will inherit from HTMLParser def __init__(self, base_url, page_url): super().__init__() self.base_url = base_url self.page_url = page_url self.links = set()
HTML links can have multiple attributes, such as class, target, and such. Since we only need the href value, we can use the code below to extract that value:
def handle_starttag(self, tag, attrs): if tag == 'a': # checks if the tag is a link for (attribute, value) in attrs: if attribute == 'href': # checks if the link attribute is href, which represents a link url = parse.urljoin(self.base_url, value) # if a link is a relative link, adds the domain name to it. self.links.add(url) # adds the URL to the set def page_links(self): return self.links def error(self, message): pass