How to use Pandas to access databases

and is that the right thing to do

Irina Truong
j-bennet codes

--

Pandas is a great tool to explore the data stored in files (comma-delimited, tab-delimited, Parquet, HDF5, etc). In fact, most tutorials that you’ll find on Pandas will start with reading some kind of a sample file (or files), most likely using .read_csv:

https://gist.github.com/j-bennet/ce2f232ef8eda4cff95280d2ab7be19b

Sometimes, the data that you want to analyze is stored in a different kind of storage, for example, a relational database. It is possible to load this data into Pandas dataframes, with an SQLAlchemy connection (or a DBAPI2 connection for sqlite):

https://gist.github.com/j-bennet/ea28d615a831dc568a806d8b1b1dcbc6

The code is very simple, and it looks nice and easy. Load all the tables into dataframes, and do the analysis on them in Pandas.

But should you?

As it often happens, the answer it not black-and-white. It depends, mostly, on the size of your data. Why does it matter? Remember that with Pandas, things are not lazy. The Pandas dataframe is a structure in memory. If your table has lots of fields and millions of records, and you try loading the whole thing into memory, you might just crash your computer, or at the very least, have an OOM (Out Of Memory exception). And loading multiple large tables? Not going to happen.

So what are the options then? When the data is stored in a database, start exploring the data from there. The database is a storage highly optimized for querying. It would be wrong not to take advantage of it.

Explore the database using a CLI

A lot of useful things about the data can be found out using SQL queries and your favorite database client. Here are some example queries (using sakila database in MySQL and mycli client):

https://gist.github.com/j-bennet/48c162a8f2d20f7718acb7bd5d1180af

Note that since these queries only return one record, or a few records at most, they can be safely issued via Pandas as well:

https://gist.github.com/j-bennet/f067ef367d846e7dd323fcfe3e6fe339

On the other hand, you’d lose SQL autocompletion, syntax highlighting, and perhaps other features that your database client provides.

For the queries above, it comes to preference — whether you use pandas to query the database or the database client application. This is not the case with queries that retrieve a lot of data. When using Pandas, it makes sense to minimize the amount of data you load into memory. There are a few strategies for that:

Limit the fields to retrieve

Hopefully, when doing the initial exploration of the data (as shown above), you zeroed out on the subset of tables and fields you’re interested in. So query for only those fields and tables when loading your Pandas dataframes, with pd.read_sql_query:

https://gist.github.com/j-bennet/e85f7694cb4ec621f06c59a84c9b81da

Limit the records to retrieve

This means either sampling the records (LIMIT XXX clause might do that), or only retrieving records that fit a specific criteria (if you only want to analyze payments in the current year, it does not make sense to load the whole table, so you’re going to need a WHERE condition in that SQL query):

https://gist.github.com/j-bennet/2aefe79c0af40f870d6bb2703f94abca

Let database server handle joins

Any time you think of retrieving records from tables into Pandas dataframes with the purpose of later joining these dataframes, it’s probably not the best idea. Databases are highly optimized for joins. It makes much more sense to unload this task to the database server. Besides, one dataframe in memory (the resulting view of the JOIN) is better than two:

https://gist.github.com/j-bennet/ca2a85974c898707c02e3155308e0348

Estimate memory usage

It is very easy to estimate memory usage with Pandas, with .memory_usage call on dataframe:

https://gist.github.com/j-bennet/546a4c20fc9b178155acdc4aa6a158ab

If a sample of 100 records take up 5,728 bytes, then 1,000,000 records will take up approximately 57280000 bytes, or 54Mb. This is very inexact, but at least it can give you an idea.

Reduce memory usage with data types

Sometimes, it is possible to use more memory efficient datatypes on the dataframe fields. By default, Pandas will read all integer data types in database as int64, even though they might have been defined as smaller data types in database. For example, let’s look at this table:

https://gist.github.com/j-bennet/f3a9311aaceba79616bd0753451d7b1e

When loaded with Pandas, data types are not equivalent to those in database:

https://gist.github.com/j-bennet/ebdbcaf1fcca2f1957780ac5321ef247

Database smallint and tinyint types got converted into int64 — which means using more memory that we have to. Can we do better?

Here is one way to optimize things:

  • Read the table (or query) in chunks, providing the chunksize parameter.
https://gist.github.com/j-bennet/ce556e9faed37c48336b34d442297b63

Since we assume the table is large, we can’t load the whole result into memory at once.

  • Convert datatypes of each chunk to smaller datatypes.
https://gist.github.com/j-bennet/a9a4971c7d48ee81341be4981e826504

Here, we know which fields can be converted into smaller types, because we saw the table definition in database.

  • Concatenate updated chunks into a new dataframe.
https://gist.github.com/j-bennet/3225de6081e210f56dd5bde21f07507e

Not bad — we now have 4500 bytes instead of 5728 bytes, a 21% reduction in memory. This excellent IPython notebook shows how to optimize data types even further, and provides a generic function to do that:

Summary

To reiterate the important points of using Pandas to explore a database:

  • Dataframes in Pandas are not lazy, they are loaded into memory, be aware of the memory usage.
  • Start exploring with a SQL client to determine the size and shape of data.
  • Proceed based on the size of data, to either load whole tables into Pandas, or query for only selected fields and possibly limit to a sample of records.
  • Let database do the joins, it’s good at it.
  • Estimate the size of data. If necessary, use memory-efficient data types.

Happy exploring!

Special thanks to Sheila Tüpker, who inspired this article by asking a very good question.

--

--