Run the program

To run the program, simply enter the values of the PROJECT_NAME and HOMEPAGE variables in the main.py file. For example, to crawl https://geek-university.com, we would enter this:

PROJECT_NAME = 'GeekUniversity'
HOMEPAGE = 'https://geek-university.com'

Now run main.py. In our case, the following messages will be displayed:

Creating project GeekUniversity
First spider now crawling https://geek-university.com
Queue 1 | Crawled 0
26 links in the queue
Thread-1 now crawling https://geek-university.com/wp-login.php?action=lostpassword&redirect_to=https://geek-university.com/
Queue 26 | Crawled 1
Thread-2 now crawling https://geek-university.com/company/
Queue 26 | Crawled 1
Thread-1 now crawling https://geek-university.com/course/mysql-course/
Queue 26 | Crawled 2

The folder GeekUniversity is created and contains two files, the queue.txt and crawled.txt files. These files contain the list of links found and crawled.

Be careful when running the crawler against a website, since specifying too many threads can significantly affect the web server performance.

Run the program

Python web crawler course