Python and SQLite3
Introduction:
Any good data analyst would need to know how to run some basic EDA tasks. To do that most often it would require connecting and fetching data from various sources like web, text files, CSV files etc.
Data transformation, data wrangling using SQL is also very important for the Data Analyst. Today I am going to show how to connect to SQLite3 database in python and execute some basic queries.
Connecting to Database:
First we should connect to the database by importing sqlite3 library in python. Then we create a cursor(cur) object to work on this like fetching results etc. Here is the code snippet.
# Connect to Database and create cursor
Import sqlite3
conn = sqlite3.connect('data.sqlite')
cur = conn.cursor()
Querying Via the cursor:
Now that we have connected with the database, let’s take a look at how we can query the data. With the cursor object you can execute queries.
cur.execute("""SELECT * FROM employees LIMIT 5;""")
The execute command returns the cursor object. To see the results you must use the “fetchall” method.
cur.fetchall()
Alternatively we can also write the above two queries as one.
cur.execute("""SELECT * FROM employees LIMIT 5;""").fetchall()
Wrapping The Results In Panda DataFrame:
We can create a pandas dataframe using these results to be able to manipulate the data better. To do so you simply wrap the cur.fetchall() output with a pandas dataframe constructor.
import pandas as pd
cur.execute("""SELECT * FROM employees LIMIT 5;""")
df = pd.DataFrame(cur.fetchall())
df.head()
As you see there are no column names. We can access the column names by calling cur.description().
cur.execute("""SELECT * FROM employees LIMIT 5;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
Additional SQL Functions
Where clause: “Where” clause filters query results by some conditions.
#SQL Where Clause
cur.execute("""SELECT firstName,lastName,email,officeCode FROM employees WHERE officeCode='1';""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
Order By and Limit Clauses: The “Order By” clause allows you to sort the results by a particular feature. The “Limit” clause is typically the last argument in SQL query and simply limits the output to a set number of results.
#SQL Order By & Limit Clause
cur.execute("""SELECT employeeNumber, firstName, lastName, email, officeCode
FROM employees
ORDER BY employeeNumber ASC
LIMIT 15
;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df
Group By Clause: The “Group By” clause groups records into summary rows and returns one record for each group. Group By is often used with aggregate functions like count, sum, avg etc., to group the result set by one or more columns.
#SQL Group By
cur.execute("""
SELECT lastName, firstName, email,officeCode,
COUNT(officeCode) as employeeCount
FROM employees GROUP BY officeCode;
""")df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
Having Clause: Having clause works similarly to where clause except that where clause cannot be used with aggregate functions. Having clause is used to filter data selections on conditions after the Group By clause.
#SQL Having Clause
cur.execute("""
SELECT lastName, firstName,
email,officeCode,COUNT(officeCode) as employeeCount
FROM employees GROUP BY officeCode
HAVING COUNT(officeCode)>=5;
""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
Conclusion:
This is a simple blog showing how to connect and query data from SQLite3 using python. In the next blog we will go a bit further into SQL like Joins, Procedures & CTEs.
Happy reading !!!
References