Gather links

In this section we will create a function that will get the HTML of a page and call the function that will extract only links. Because Python converts human readable HTML it gets into bytes, we also need to convert those bytes to human-readable characters before passing them to LinkFinder.

 # Converts raw response data into readable information and checks for proper HTML formatting
 @staticmethod
 def gather_links(page_url): # the method that will crawl a page and return the set of links
     html_string = '' # the variable that will hold the HTML string
     try: # the try...except statement ensures that our program won't crash if there is an exception
         response = urlopen(page_url) # the variable that will hold the response (in byte data)
         if 'text/html' in response.getheader('Content-Type'): # checks if the crawled page contains an actual HTML data
             html_bytes = response.read() # reads the response (in byte data)
             html_string = html_bytes.decode("utf-8") # converts the raw data into an HTML string
         finder = LinkFinder(Spider.base_url, page_url) # creates the LinkFinder object
         finder.feed(html_string) # parses the HTML data
     except Exception as e:
         print(str(e)) # prints the error
         return set() # returns an empty set
     return finder.page_links() # returns links
SEE ALL Add a note
YOU
Add your Comment
 

Who’s Online

Profile picture of Yousra kh
Geek University 2021