This is a tutorial designed to teach you how to web scrape using Python. By the time you are through learning, you should be able to accomplish this task in just under 5 Minutes.
But first, the basics of web scraping will be explained. So, if you are a beginner to web scraping or you never knew python could be used to web scrape, then this tutorial is for you.
So what is Web Scraping?
If you are thinking of extracting a huge quantity of web data from a certain website and you are concerned about the size of the information and the time it may take you to achieve this feat. You really shouldn't be that worried as web scraping would help you greatly.
Web scraping allows you to first of all gain complete access to a website and secondly, to extract huge amounts of data from the website.
This is all done automatically as well and as a result, you will be able to save valuable man hours and an immeasurable amount of human effort.
This tutorial will teach you how you can automatically extract huge amounts of files right out of New York MTA. The processes highlighted in this example are designed for newbies to the world of web scraping. So that by the time you are done with this tutorial, you would have a basic knowledge of the web scraping process.
Turnstile data will be downloaded from;
You should also know that the compilation of turnstile data is a weekly chore and it has been so since the second quarter of 2010 till date.
Now you can imagine the huge amount of .txt files that are currently on the website. The image below will give you an idea of a typical data file on the site. Every one of the date shown in the image below is actually a direct link to a .txt file which can be downloaded by you.
Imagine manually trying to save each of the links to your personal desktop computer. You would have to highlight each of the links one-by-one, before you right-click and then save them straight to your PC.
The amount of effort and time it would take is something that is better left unimagined. However, with web scraping, you do not have to go through this strenuous exercise as you can save the links automatically and with much less stress.
But before you start web scraping, there are some things you really should know.
● The first thing you should know is how the data on the website can be used legally. To know this, you should endeavor to thoroughly read the T&Cs of the website. This is important because there are many websites that blatantly refuse the use of their data for any commercial activity.
● Another thing you should know, is to avoid extracting data at a rate that is considerably too fast. If you download data from the website at an overly fast rate, you could end up breaking the website. In addition, you could be prohibited from gaining access to the website.
Inspecting the site for Data file Links
The web scraping process begins with you inspecting the website. You want to be able to know exactly where those data file links you intend downloading are located on the site.
Web scraping requires a basic understanding of HTML. This is because you would be searching through multiple HTML tags to locate the code that holds the data you intend extracting. So, having a basic knowledge of HTML is key to web scraping.
While on the site, you should first right-click after which you should click the "Inspect" option that is displayed when you right-click.
By doing this, you would be able to view the backend code of the site.
After clicking on the "Inspect" option, a console like the one in the image below will then appear.
You will also see an arrow icon (like the one in the image below) which can be seen at the top left-hand corner of the pop-up console.
Whenever you click this arrow followed by clicking any part of the website, you will see displayed on the console the code of that part you clicked on.
For example, if you click on the topmost data file which is labelled as "Saturday, September 22, 2018", the data file link will be displayed in blue color on the pop-up console.
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>
If you critically look at the file link above, you will see that every one of the .txt files is within the <a> tag. Note that <a> tag is useful when creating hyperlinks.
At this point, you now know how to find data file links, now it's about time you learn about coding.
Using Python Code
You begin with importing four libraries. These include;
from bs4 import BeautifulSoup
The next step is to gain access to the website through the "requests library". To do this, you will need to first have the website url set up as follows;
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
In a situation where you were able to gain access, the output response would be as follows;
You then need the html to be parsed using Beautifulsoup. Beautifulsoup ensures that the data you work with is much more refined.
By reading through BeatifulSoup documentation, you will be able to learn a bit more about the library.
soup = BeautifulSoup(response.text, “html.parser”)
To find everyone of the <a> tags use .findAll.
The code above will provide you with all the available line of code having the <a> tag. Note that it is not every link that is important to what you want to achieve. In this case, line 36 is where the required data begins.
In the image below, you can see the subset generated by Beautifulsoup from the call of the previously highlighted code.
You now have to extract that required data file link.
From the initial link you have;
one_a_tag = soup.findAll(‘a’)
link = one_a_tag[‘href’]
The code above is saved to your variable link as;
Note that ‘http://web.mta.info/developers/data/nyct/turnstile/turnstile_180922.txt’is the complete url for downloading the required data.
To download the file path straight to your desktop computer, you should make use of "urllib.request library".
"request.urlretrieve" is given a couple of parameters, namely;
- The file url and
- the filename.
For the purpose of this tutorial, they can be named as follows for your data files:
“turnstile_172802” and so on.
download_url = 'http://web.mta.info/developers/'+ link
Finally, you need to be careful not to spam the site. You can be flagged or labelled a spammer if you make too many requests at once. To avoid this predicament, you should add a line of code like the one below. This will help put your code on ice for a bit.
And that's it! You can now attempt using a for loop to download all the data files, since you now know how to go about downloading a single data file. The image below shows a code which holds every line code needed to have NY MTA turnstile data web scraped.