Gather links
In this section we will create a function that will get the HTML of a page and call the function that will extract only links. Because Python converts human readable HTML it gets into bytes, we also need to convert those bytes to human-readable characters before passing them to LinkFinder.
Write the following code in the spider.py file, under the Spider class:
# Converts raw response data into readable information and checks for proper HTML formatting
@staticmethod
def gather_links(page_url): # the method that will crawl a page and return the set of links
html_string = '' # the variable that will hold the HTML string
try: # the try...except statement ensures that our program won't crash if there is an exception
response = urlopen(page_url) # the variable that will hold the response (in byte data)
if 'text/html' in response.getheader(
'Content-Type'): # checks if the crawled page contains an actual HTML data
html_bytes = response.read() # reads the response (in byte data)
html_string = html_bytes.decode("utf-8") # converts the raw data into an HTML string
finder = LinkFinder(Spider.base_url, page_url) # creates the LinkFinder object
finder.feed(html_string) # parses the HTML data
except Exception as e:
print(str(e)) # prints the error
return set() # returns an empty set
return finder.page_links() # returns links