Can SQLAlchemy be used to manipulate Pandas dataframes in Python?

Question

Rather than learn Pandas syntax, which would take some time, I'd like to use my existing SQL skills to manipulate Pandas dataframes in Python.

My data are already in Pandas dataframes for various reasons.

If someone knows the answer, can you show me a simple example to get me started? Something like the following, I'm guessing:

import sqlalchemy

query = """
    select 
        trunc(date, 'MM') as month
        , avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
        , avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
    from pandas_df
    where date >= add_months(trunc(sysdate, 'DD'), -2)
    group by trunc(date, 'MM')
    order by trunc(date, 'MM');
"""

new_df = sqlalchemy.run_query(query)

Thanks! Sean

Can you explain what you mean by "manipulate"? You can query/select rows based on arbitrary conditions- pandas.pydata.org/pandas-docs/stable/generated/…. To take advantage of all that pandas has to offer, you'll need to learn pandas. — Andrew L
– Andrew L, Commented Jan 22, 2018 at 16:20
By "manipulate", I mean doing aggregations on groups, case statements on the selected columns, complex filters, subqueries, etc. I'm very skilled at these things in SQL, but learning them in Pandas has been slow. I will learn Pandas over time, but I was wondering if SQLAlchemy could do these things with existing Pandas dataframes. Thanks for the link--the Pandas documentation is very good. — Sean McCarthy
– Sean McCarthy, Commented Jan 22, 2018 at 18:14
The short answer is no.. pandas is not a query language, therefore the ability to write queries to complete the above mentioned tasks with pandas dataframes is very limited. If you don't need the true capabilities of pandas, you should load your data into tables and do what you need to do in SQL. — Andrew L
– Andrew L, Commented Jan 22, 2018 at 18:53
Look into pandasql allowing you to query pandas dataframes like db tables. Syntax uses SQLite dialect. — Parfait
– Parfait, Commented Jan 23, 2018 at 17:02

Parfait · Accepted Answer · 2018-01-22 21:33:32Z

Meanwhile, consider learning pandas's conditional logic/filter/aggregation methods which are translatable from your SQL skills:

trunc(..., 'MM') ---> .dt.month
CASE WHEN ---> df.loc or np.where()
WHERE ---> []
GROUP BY ---> .groupby
AVG --> .agg('mean')

Here is an example

Random Data (seeded for reproducibility)

import numpy as np
import pandas as pd
import datetime as dt
import time

epoch_time = int(time.time())

np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
                          'hour_ending': [np.random.randint(14) for _ in range(50)],
                          'volume': abs(np.random.randn(50)*100)})

# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)

print(pandas_df.head(10))

#                  date  hour_ending      volume
# 0 2017-01-01 20:05:19           10   56.660415
# 1 2017-09-02 00:56:27            3   79.060800
# 2 2018-01-04 09:25:05            7   23.076240
# 3 2016-11-27 23:44:55            6  102.801241
# 4 2017-01-29 12:19:55            5   88.824230
# 5 2017-04-15 15:16:09            6  214.168659
# 6 2017-09-07 08:12:45            9   97.607635
# 7 2017-12-31 15:35:36           13  141.467249
# 8 2017-04-21 23:01:44           13  156.246854
# 9 2016-12-22 09:27:49            2   67.646662

Aggregation (calculates columns prior to summarizing)

# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month

# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume'] 
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume'] 

print(pandas_df.head(10))
#                  date  hour_ending      volume  month  volume_peak  volume_off_peak
# 0 2017-01-01 20:05:19           10   56.660415      1    56.660415              NaN
# 1 2017-09-02 00:56:27            3   79.060800      9          NaN        79.060800
# 2 2018-01-04 09:25:05            7   23.076240      1          NaN        23.076240
# 3 2016-11-27 23:44:55            6  102.801241     11          NaN       102.801241
# 4 2017-01-29 12:19:55            5   88.824230      1          NaN        88.824230
# 5 2017-04-15 15:16:09            6  214.168659      4          NaN       214.168659
# 6 2017-09-07 08:12:45            9   97.607635      9    97.607635              NaN
# 7 2017-12-31 15:35:36           13  141.467249     12   141.467249              NaN
# 8 2017-04-21 23:01:44           13  156.246854      4   156.246854              NaN
# 9 2016-12-22 09:27:49            2   67.646662     12          NaN        67.646662

# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
                   .groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')

print(agg_df)
#        volume_peak  volume_off_peak
# month                              
# 1        62.597999        23.076240
# 11       37.775000        17.075594
# 12      141.063694        29.986261

SQL (using MS Access that imported csv from above code)

SELECT Month(p.date) As month, 
       AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
       AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)

-- month                peak            off_peak
--     1    62.5979990645683    23.0762401295465
--    11    37.7750002748325    17.0755937444385
--    12    141.063693957234    29.9862605960166

James · Accepted Answer · 2018-01-22 21:14:20Z

The short answer is no, there is no direct way that I am aware of to manipulate a pandas DataFrame using SQL Alchemy.

If you really want to use SQL, I think the easiest approach would be to write a couple of helper functions that convert a pandas DataFrame to and from a SQL table using a database driver like SQLite.

However, as a developer well-versed in SQL who recently had to deal with some data in a pandas DataFrame, I can strongly recommend just learning the pandas way of doing things. It's not as declarative as SQL - often what would be a single query in SQL has to be broken up into a few steps in pandas. However I found that working pandas gets easier quickly and the learning process was beneficial and satisfying.

Collectives™ on Stack Overflow

Can SQLAlchemy be used to manipulate Pandas dataframes in Python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related