A set in Python is a collection type that contains an unordered collection of unique and immutable objects. Unlike lists and tuples, sets can’t have multiple occurrences of the same element. We will use sets in our program to store URLs we need to crawl. Since sets can contain only unique elements, no URL can be added twice.
While we are working with data, we are going to use sets, and periodically we will take that data and save it to files. We are doing this because we want to avoid writing to the queue.txt and crawled.txt files all the time, since the write operation on a file can be really slow.
The first function we will add to the general.py file will convert links in a queue or crawled file to a set:
def file_to_set(file_name): # the function that will read a file and convert each line to set items results = set() # creates an empty set with open(file_name, 'rt') as f: for line in f: # loop through each line in a file results.add(line.replace('\n', '')) # add a line to a set and remove a newline character we've added earlier return results
The code above will read a file that contains links that need to be crawled and convert them to a set.
The second function will convert items in a set to a file:
#Iterate through a set, each item will be a line in a file def set_to_file(links, file_name): with open(file_name,"w") as f: for l in sorted(links): f.write(l+"\n")