Create sets

A set in Python is a collection type that contains an unordered collection of unique and immutable objects. Unlike lists and tuples, sets can’t have multiple occurrences of the same element. We will use sets in our program to store URLs we need to crawl. Since sets can contain only unique elements, no URL can be added twice.

While we are working with data, we are going to use sets, and periodically we will take that data and save it to files. We are doing this because we want to avoid writing to the queue.txt and crawled.txt files all the time, since the write operation on a file can be really slow.

The first function we will add to the general.py file will convert links in a queue or crawled file to a set:

def file_to_set(file_name): # the function that will read a file and convert each line to set items
    results = set() # creates an empty set
    with open(file_name, 'rt') as f:
        for line in f: # loop through each line in a file
            results.add(line.replace('\n', '')) # add a line to a set and remove a newline character we've added earlier
    return results

The code above will read a file that contains links that need to be crawled and convert them to a set.

The second function will convert items in a set to a file:

#Iterate through a set, each item will be a line in a file
def set_to_file(links, file_name):
    with open(file_name,"w") as f:
        for l in sorted(links):
            f.write(l+"\n")
Geek University 2022