If you want to be able to extract as much data as possible from websites, you might want to consider applying a technique that is quick and automated.
One tried and tested automated method for extracting targeted data out of a targeted website is known as "Web Scraping".
Web scraping is a very useful data extraction technique that can help you in achieving a great deal of work in less amount of time.
This article is aimed at teaching you the fundamentals of web scraping. In this tutorial you will learn how you can avoid detection while web scraping, what you should and should not do as well as how you can hasten the web scraping process.
There are also use cases in this tutorial to give you practical examples of why, when and where web scraping could be quite helpful. All these along with python snippets and packages will be provided in this post with the aim of preparing you for a quick start to your web scraping adventure.
So without further ado, let's begin shall we!
Typical Web Scraping Use Cases
In truth, there are a whole lot of reasons why you may choose to web scrape. Some of these reasons may be as mundane as scraping the web pages of say an online store to get information on the prices of grocery items and comparing them with that of a competing online store.
You may do this to know which of the two online stores offer the best prices on grocery items, so that you can start patronising the store with the best offer (lowest price).
You may even decide to scrape the web pages of an airline website to get information on the prices of flight tickets at different times of the day.
This web scraping could be done so that you are armed with vital data on flight ticket price variations which would help you in the timing of your flight ticket purchase. You would always be paying for tickets when the price drops or when the tickets are sold at a competitive rate.
Whatever your intentions are for web scraping you would need to equip yourself with the necessary knowledge, which is what this tutorial provides.
Tools of the Trade
Before you think this tutorial is all about promoting and selling a product, well it's not! The beauty of web scraping is that there are different ways to go about it.
Every website is unique in the sense that data may be stored differently on two similar sites. So if you are going to scrape the site for its data, you would have to study it's structure first and foremost.
Once you have a clear understanding of the site's structure, you could then decide to personally create a web scraping soultion or make use of a dynamic scraping package or tool.
The good news is, there are quite a number of web scraping packages and tools that you can use. It really all depends on your purpose for web scraping and the level of your programming skill set.
Inspect, inspect, inspect...
In web scraping you will spend a significant amount of time inspecting a site's HTML. Make use of your browser's inspection feature.
In the image above, you can see the "hero". This is that area of the website with the name of the owner of the site, the owner description and also avatar (hero--profile u-flexTOP).
Also, the name of the owner of the site is held under the <h1> class known as ui-h2 hero-title. On the other hand, the description of the owner of the site is held under <p> class ui-body hero-description.
Have you heard of Scrapy?
Scrapy is a web scraping package that you can use to extract HTML. This package is dynamic and can be customised to suit your scraping needs. With Scrapy, you also have many features to enhance your web scraping experience. Scrapy allows you to logging or even have data exported in a variety of formats.
You can also use API in extracting target data. In addition, with Scrapy you can have cookies disabled as well as have a variety of spiders run on a number of processes. While Scrapy is quite a useful web scraping package, it may be a bit too much for programming beginners.
BeautifulSoup and Request Library
If you want to parse a site's HTML source code in a seamless and smooth way, one library that can help you achieve this is BeautifulSoup.
Using a "Request Library" with Beautiful Soup, you can retrieve url content. The difference between Scrapy and BeautifulSoup is that while the former would more or less take care of many aspects of scraping automatically, the latter would require a bit more manual work. This is cool for programmers with an above average skill set.
What is Basic Code?
As mentioned earlier, you need to inspect the target website's HTML in order to gain access to the classes as well as IDs.
With the use case below, you have a html structure from which the main_price data of an item is targeted for extraction.
In this use case the discounted_price data is regarded as optional.
<div class="main_price">Price: $50.25</div>
<div class="discounted_price">Discounted price: $45.35</div>
<div class="main_price">Price: $48.50</div>
When it comes to the basic code, you would need to…
- have the libraries imported,
- execute the request,
- have the html source code parsed and
- locate the class main_price.
There are situations where the class main_price can be found at a different location on the website. If you want to ensure that only the targeted elements from the class main_price is extracted and not other data, sort out the id listings_prices first of all.
Be wary of the following drawbacks
#1: HTML Issues
Note that HTML tags hold class and id or even both. The HTML id indicates a specific id while the HTML class is not so specific. Where there is any alteration in an element or class name, it could give you inaccurate outcomes or disrupt your code.
To ensure that you do not face this problem, make use of a distinct id instead of class. This is because the id is more specific while the class is less specific and highly changeable.
Also, go through the element to know if it returns None. You should also be aware that there are optional fields, such as; the discounted_price as stated in the html use case shown earlier. Correlating elements do not necessarily show up on every listing. But for this use case the amount of times that the distinct element returned None vis-a-vis the number of listings, can be quantified.
#2: Observe webscraping protocol
If you check the robots.txt file of a website you will find the site's web scraping protocol. To get to this location, you only need to include (robots.txt) to the site's main domain. For example, www.mywebste_to_scrape.com/robots.txt.
You should note that the protocol will indicate those sections of your target website that will not allow you extract data automatically. Also, the protocol will indicate the permitted rate of request of a web page by a bot.
While it is quite understandable that many people tend to overlook the website rules, taking a look at them before web scraping could give you a fair idea about what to expect from the website.
#3: Set Request timeout parameter
It is important that you set request timeout parameter because a "Request" is designed by default to wait until there is a response. If you do not set request timeout parameter the wait for a response could be indefinite.
#4: One User agent for multiple requests is a dead giveaway
You should know that with each visit to a target website, the information of your browser is obtained through user agent. As a matter of fact, you may not be able to view content on some websites if your user agent is not provided. While websites are not open to blocking every visitor, especially if they are genuine. It must be said that if you use a single user agent to send as many as 200 requests per second, you may be blocked eventually.
To avoid this, you may need to manually set a user agent or alternatively you can generate user agents at randomly.
#5: The IP Address Issue
While you can alter your user agent at random, you should know that the requests you make would still be from one IP address.
This ordinarily is not a problem, but when the amount of requests emanating from just one IP address is overwhelming, then a server would easily identify the IP address.
To vary your IP address, you can make use of shared proxies like TOR and VPN. If you do, you could literally operate as an unknown user.
When you make use of shared proxies like TOR or VPN, your target website does not see your IP address. Instead, it is your chosen proxy server's IP address that is seen by your targeted website.
#6: You could be blocked
It is possible that you have been blocked if you are constantly inundated with the following status codes;
- 403 (Forbidden)
- 404 (Not Found)
- 408 (Request Timeout)
#7: The Honeypots Conundrum
One challenge you are likely to face when web scraping is that of Honeypots. Websites use Honeypots to spot and flag scrapers/crawlers. Honeypots may appear as hidden links. You may not view these links, but you could still be able to extract them without knowing it.
Usually these links come in a style similar to CSS and are set by default as (display:none). As soon as your scraper/crawler gets on this type of link, you can expect that your IP address would be flagged or you may be blocked straight away.
What you SHOULD/should NOT do before and during Scraping
- Try to ascertain the availability of public API. With a public API, you can extract data quicker and more seamlessly than scraping.
- Analyze and extract data fast if you are scraping large amounts of data by making use of a database.
- Avoid overloading your target website with massive amounts of requests made per second.
- Try to be courteous when scraping. You can inform the owners of your target website that you intend scraping their site. This way the owners of the website would be able to respond in the proper manner as a result of your actions.
Scraping with Speed but with Care
You may decide to boost your scraping speed or engage in "parallelization". If you do, you should take care not to overload the server.
Finally, where you are extracting a large volume of data and you conduct a pre-processing of the retrieved data at the same time. The request volume per second that you send out to the web page of the targeted website will be comparatively low.