Web Scraping using Beautiful Soup and Pandas

Sailaja Karra
2 min readFeb 25, 2020

--

From Skill Share

Today we are going to look at how to do web scraping using both beautiful soup and pandas. It seems most of us are using both beautiful soup and pandas libraries but we never seem to be using them together. I would like to show how powerful and seamless the whole web scraping process can be if we use both of them together.

We have seen a few examples is how to do make a request call using python’s requests library. The result object is a response object with (hopefully) a return response code of 200. This means we have been able to connect to the website and get the data back.

# code snippet to get Population data from wikipedia
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'res = requests.get(url)

Little Nugget: If you are behind a firewall and need to set your proxies, here a small code sample for this.

#https://requests.readthedocs.io/en/latest/user/advanced/#proxiesproxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)

Once you have the response object then we can start using beautiful soup to extract the “Html” tag data. This is super useful as you can select any particular data and get the details.

soup = BeautifulSoup(res.text,'html.parser')
table = soup.find('table')

The interesting part that I like to talk about today is how we can use pandas to get refined data right away after the above steps. Pandas has a specific function called Pandas.read_html(). This function reads the data from the output of beautiful soup and converts that into a nice data frame for you.

population_df = pd.read_html(str(table))[0]
population_df.head()

As you can see the whole process is quite easy and effortless.
Here is the full code.

import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
table = soup.find('table')
population_df = pd.read_html(str(table))[0]
population_df.head()

Thanks for reading !!!

--

--

No responses yet