Run the program
To run the program, simply enter the values of the PROJECT_NAME and HOMEPAGE variables in the main.py file. For example, to crawl https://geek-university.com, we would enter this:
PROJECT_NAME = 'GeekUniversity' HOMEPAGE = 'https://geek-university.com'
Now run main.py. In our case, the following messages will be displayed:
Creating project GeekUniversity First spider now crawling https://geek-university.com Queue 1 | Crawled 0 26 links in the queue Thread-1 now crawling https://geek-university.com/wp-login.php?action=lostpassword&redirect_to=https://geek-university.com/ Queue 26 | Crawled 1 Thread-2 now crawling https://geek-university.com/company/ Queue 26 | Crawled 1 Thread-1 now crawling https://geek-university.com/course/mysql-course/ Queue 26 | Crawled 2
The folder GeekUniversity is created and contains two files, the queue.txt and crawled.txt files. These files contain the list of links found and crawled.
Be careful when running the crawler against a website, since specifying too many threads can significantly affect the web server performance.