Whenever there is a python debate amongst web designers and tech. enthusiasts, one issue for that is mentioned more often than not is - “web scraping”.
It must be said that web scraping can be done in a whole host of ways. In truth, there is no hard and fast route to web scraping. What's more, there is no “ideal” method for web scraping either and this is because of its versatility.
The DIY methodology to scraping may be dynamic, but it is not ideal if you want to extract a huge amount of data. This is because the more requests there is, the longer it would take.
This being said, rather than requests, there is another viable option. This option is asyncio (aiohttp) library. It can be used in writing a small sized scraper in a speedy manner and you can learn how to use it in this tutorial.
This tutorial begins with a basic explanation of what asyncio is and then explains how it can be used in writing a scraper.
So What's asyncio?
Basically, asyncio is that “asynchronous IO library" which was added to Python 3.4. Note that asyncio is also available in python 3.3 under pypi.
You can use asyncio in writing “asynchronous codes”. However, before you do, there are some basics that you should be aware of as explained below;
In writing asynchronous codes you need to be aware of “Co-routines”. These are similar to functions used when writing the code. Co-routines can be paused and then continued at different times during the writing of an asynchronous code.
Another thing you need to be aware of when using asyncio in writing asynchronous codes is the “event loop".
It is the event loop that prompts the implementation of Co-routines. So in effect the event loop acts like a catalyst to the Co-routines.
The aiohttp can be best described as a library which has been modified to be compatible with asyncio. You should know that the API of aiohttp is also similar to requests.
Using aiohttp to write an asynchronous code
You can instruct a Co-routine to retrieve and then print out a page. To do this, make use of asyncio.coroutine in disgusting one function and masking it act as a single Co-routine.
Note that aiohttp.request for example, acts as one Co-routine. Also, the “read method”acts as a Co-routine as well. You can call them by utilizing “yield from”. Other than this, your asynchronous code should be seamless written as seen in the image below;
Using “yield from”, you can call one Co-routine out of yet another single Co-routine. Use an event loop if you want to call one Co-routine out of a synchronous code.
With asyncio.get_event_loop() you can obtain a basic one while the Co-routine can be run on it with the run_until_complete() methodology. If you want to run the last Co-routine, you can do so in this way;
Another important function you should know about is the asyncio.wait function. This function will collate a number of Co-routines and have them represented by just one Co-routine as seen below;
With asyncio.as_completed function, a number of Co-routines will be collated and then an iterator will be returned in their place. This iterator highlights the collated Co-routines according to their sequence of completion. This enables you to derive speedy results when needed and available.
Writing a Scraper
With an understanding of how asynchronous HTTP requests are done, you can now attempt to write out a scraper. However, you would require an app that enables you read html conveniently. There are several out there, such as; lxml, pyquery or beautifulsoup.
For the purpose of this tutorial, beautifulsoup is the html reader used.
Writing a small scraper
The scraper in this example is for the derivation of torrent links with respect to several linux distributions located in a prirate bay area.
You first start by writing a “helper Co-routine”. This Co-routine will help to carry out “GET” requests. An example of this Co-routine can seen in the image below;
Next, parsing follows with a “first magnet” list for the page derived as seen below;
After this the Co-routine is next. Note, the URL provided ensures that outcomes are arranged according to the number of possible seeders. In effect, the initial outcome would invariably be the one with the greatest number of seeders.
Lastly, you now have the code which will call them all shown below;
Finally, yo have now created a small sized scraper which functions asynchronously. What this means is that several pages will be downloaded simultaneously. In the example provided, you have written a small scraper that offers a speed that is thrice as fast as a similar code using requests. Using the example provided for guidance, you should be able to write a scraper with ease.
asyncio offers a lot of very interesting features and checking out their documentation is where to learn more.
While this is a nice tip on having several requests done simultaneously, you should be weary of over flooding the server with requests. So avoid ridiculously high requests that will affect your connection or even see you being prohibited from using the website.
One way to ensure that your requests are contained is to utilize a “semaphore”. This is a synchronization tool which helps to constrain Co-routines. It will ensure that certain amounts of Co-routines carrying out a particular objective are limited in number. This ultimately will limit the possible requests that can be made at once.
In the example below, semaphore is created prior to the creation of the event loop. This will limit the amount of requests that will be made at the same time.
Next, add the semaphore to this;
As you can see in the image below, the semaphore is added to the loop and limits the number of simultaneous requests.
In this example, a maximum of five (5) requests is permitted to be made at once.
Progress Bar Tip
Another tip that that would be useful to you is the “tqdm”. This is a library that allows you to create progress bars. Functioning in a similar fashion to asyncio.wait, the Co-routine has the added advantage of showing a progress bar. This progress bar will highlight the level of completion of each functioning Co-routine.