Parse domain names

In this chapter we are going to write two functions in a new .py file. The functions will extract the domain name of the web page we would like to crawl. This will help us in keeping the crawler only on the domain we would like to crawl.

The first function will extract the domain name from the URL. The second function will get the subdomain name, if present in the URL. Write the following code in the domain.py file:

from urllib.parse import urlparse # imports the module for URL parsing

def get_domain_name(url):
     try:
         results = get_sub_domain_name(url).split('.') # splits the URL so we can return only the last two elements (such as example.com)
         return results[-2] + '.' + results[-1] # returns only the last two elements in the list
     except:
         return ''

def get_sub_domain_name(url):
     try:
         return urlparse(url).netloc
     except:
         return '' # ensures that at least something is returned

You can verify that the functions are indeed working by calling the function, e.g:

print(get_domain_name('https://mail.geek-university.com/courses'))
geek-university.com

Parse domain names

Python web crawler course