Parse domain names
In this chapter we are going to write two functions in a new .py file. The functions will extract the domain name of the web page we would like to crawl. This will help us in keeping the crawler only on the domain we would like to crawl.
The first function will extract the domain name from the URL. The second function will get the subdomain name, if present in the URL. Write the following code in the domain.py file:
from urllib.parse import urlparse # imports the module for URL parsing def get_domain_name(url): try: results = get_sub_domain_name(url).split('.') # splits the URL so we can return only the last two elements (such as example.com) return results[-2] + '.' + results[-1] # returns only the last two elements in the list except: return '' def get_sub_domain_name(url): try: return urlparse(url).netloc except: return '' # ensures that at least something is returned
You can verify that the functions are indeed working by calling the function, e.g:
print(get_domain_name('https://mail.geek-university.com/courses')) geek-university.com