How Scrape HTML Tabular data with Python
Many of you might have already read several articles about data scraping from the websites. Most of them suggested using Node.js with Cheerio library or Python with Beautiful Soup. Although it is very effective when you master the techniques, it takes your time and effort until you finish all the coding for finding an element you need, requesting data, cleaning data to create a dataframe before you can do the actual data analysis. (And, of course, some additional time to fix all the bugs and errors.
This short article will show you a tutorial on how to the easiest way to scrape the tabular data from any website with the three lines of Python Script!
Example of Scraping Real-time COVID-19 Data from Worldometer:
For example, you want to get the tabular data from the Worldometer website . As this dataset is dynamic, changing over time, the Data scraping is make-sense that we get the most updated result every time when running the script!
To scrape this dataset, get your machine ready with Python and Pandas. We gonna use the Pandas read_html() to extract all tables of any webpage. However, we cannot just use it to read URL directly because you might face an error 403: Forbidden. To avoid the error, we gonna request it with requests module first to get the HTML body before use Pandas to read it. Overall, the script looks like this:
import requests, pandas as pd
r = requests.get('http://www.worldometers.info/coronavirus/')
dfs = pd.read_html(r.text)
pandas.read_html() function searches for HTML <table>
related tags on the input (URL) you provide. It always returns a list, even if the site only has one table.
dfs[0]
# Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/ 1M pop Population
0 NaN World 5622939 +38,672 348715.0 +1,102 2393539.0 2880685.0 53131.0 721.0 44.7 NaN NaN NaN
1 1.0 USA 1709243 +3,017 99883.0 +78 465668.0 1143692.0 17116.0 5167.0 302.0 15204572.0 45961.0 3.308117e+08
2 2.0 Brazil 376669 NaN 23522.0 NaN 153833.0 199314.0 8318.0 1773.0 111.0 735224.0 3461.0 2.124098e+08
3 3.0 Russia 362342 +8,915 3807.0 +174 131129.0 227406.0 2300.0 2483.0 26.0 9160590.0 62775.0 1.459285e+08
4 4.0 Spain 282480 NaN 26837.0 NaN 196958.0 58685.0 854.0 6042.0 574.0 3556567.0 76071.0 4.675305e+07
...
dfs[1]
# Country,Other TotalCases NewCases TotalDeaths NewDeaths TotalRecovered ActiveCases Serious,Critical Tot Cases/1M pop Deaths/1M pop TotalTests Tests/ 1M pop Population
0 NaN World 5584267 +90,184 347613.0 +3,096 2362984.0 2873670.0 53167.0 716.0 44.6 NaN NaN NaN
1 1.0 China 82985 +11 4634.0 NaN 78268.0 83.0 7.0 58.0 3.0 NaN NaN 1.439324e+09
2 2.0 USA 1706226 +19,790 99805.0 +505 464670.0 1141751.0 17114.0 5158.0 302.0 15187647.0 45910.0 3.308117e+08
3 3.0 Brazil 376669 +13,051 23522.0 +806 153833.0 199314.0 8318.0 1773.0 111.0 735224.0 3461.0 2.124098e+08
4 4.0 Russia 353427 +8,946 3633.0 +92 118798.0 230996.0 2300.0 2422.0 25.0 8945384.0 61300.0 1.459285e+08
...