0

Rather than learn Pandas syntax, which would take some time, I'd like to use my existing SQL skills to manipulate Pandas dataframes in Python.

My data are already in Pandas dataframes for various reasons.

If someone knows the answer, can you show me a simple example to get me started? Something like the following, I'm guessing:

import sqlalchemy

query = """
    select 
        trunc(date, 'MM') as month
        , avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
        , avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
    from pandas_df
    where date >= add_months(trunc(sysdate, 'DD'), -2)
    group by trunc(date, 'MM')
    order by trunc(date, 'MM');
"""

new_df = sqlalchemy.run_query(query)

Thanks! Sean

4
  • Can you explain what you mean by "manipulate"? You can query/select rows based on arbitrary conditions- pandas.pydata.org/pandas-docs/stable/generated/…. To take advantage of all that pandas has to offer, you'll need to learn pandas. Commented Jan 22, 2018 at 16:20
  • By "manipulate", I mean doing aggregations on groups, case statements on the selected columns, complex filters, subqueries, etc. I'm very skilled at these things in SQL, but learning them in Pandas has been slow. I will learn Pandas over time, but I was wondering if SQLAlchemy could do these things with existing Pandas dataframes. Thanks for the link--the Pandas documentation is very good. Commented Jan 22, 2018 at 18:14
  • 1
    The short answer is no.. pandas is not a query language, therefore the ability to write queries to complete the above mentioned tasks with pandas dataframes is very limited. If you don't need the true capabilities of pandas, you should load your data into tables and do what you need to do in SQL. Commented Jan 22, 2018 at 18:53
  • 1
    Look into pandasql allowing you to query pandas dataframes like db tables. Syntax uses SQLite dialect. Commented Jan 23, 2018 at 17:02

2 Answers 2

2

Meanwhile, consider learning pandas's conditional logic/filter/aggregation methods which are translatable from your SQL skills:

  1. trunc(..., 'MM') ---> .dt.month
  2. CASE WHEN ---> df.loc or np.where()
  3. WHERE ---> []
  4. GROUP BY ---> .groupby
  5. AVG --> .agg('mean')

Here is an example

Random Data (seeded for reproducibility)

import numpy as np
import pandas as pd
import datetime as dt
import time

epoch_time = int(time.time())

np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
                          'hour_ending': [np.random.randint(14) for _ in range(50)],
                          'volume': abs(np.random.randn(50)*100)})

# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)

print(pandas_df.head(10))

#                  date  hour_ending      volume
# 0 2017-01-01 20:05:19           10   56.660415
# 1 2017-09-02 00:56:27            3   79.060800
# 2 2018-01-04 09:25:05            7   23.076240
# 3 2016-11-27 23:44:55            6  102.801241
# 4 2017-01-29 12:19:55            5   88.824230
# 5 2017-04-15 15:16:09            6  214.168659
# 6 2017-09-07 08:12:45            9   97.607635
# 7 2017-12-31 15:35:36           13  141.467249
# 8 2017-04-21 23:01:44           13  156.246854
# 9 2016-12-22 09:27:49            2   67.646662

Aggregation (calculates columns prior to summarizing)

# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month

# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume'] 
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume'] 

print(pandas_df.head(10))
#                  date  hour_ending      volume  month  volume_peak  volume_off_peak
# 0 2017-01-01 20:05:19           10   56.660415      1    56.660415              NaN
# 1 2017-09-02 00:56:27            3   79.060800      9          NaN        79.060800
# 2 2018-01-04 09:25:05            7   23.076240      1          NaN        23.076240
# 3 2016-11-27 23:44:55            6  102.801241     11          NaN       102.801241
# 4 2017-01-29 12:19:55            5   88.824230      1          NaN        88.824230
# 5 2017-04-15 15:16:09            6  214.168659      4          NaN       214.168659
# 6 2017-09-07 08:12:45            9   97.607635      9    97.607635              NaN
# 7 2017-12-31 15:35:36           13  141.467249     12   141.467249              NaN
# 8 2017-04-21 23:01:44           13  156.246854      4   156.246854              NaN
# 9 2016-12-22 09:27:49            2   67.646662     12          NaN        67.646662

# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
                   .groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')

print(agg_df)
#        volume_peak  volume_off_peak
# month                              
# 1        62.597999        23.076240
# 11       37.775000        17.075594
# 12      141.063694        29.986261

SQL (using MS Access that imported csv from above code)

SELECT Month(p.date) As month, 
       AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
       AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)

-- month                peak            off_peak
--     1    62.5979990645683    23.0762401295465
--    11    37.7750002748325    17.0755937444385
--    12    141.063693957234    29.9862605960166
Sign up to request clarification or add additional context in comments.

Comments

1

The short answer is no, there is no direct way that I am aware of to manipulate a pandas DataFrame using SQL Alchemy.

If you really want to use SQL, I think the easiest approach would be to write a couple of helper functions that convert a pandas DataFrame to and from a SQL table using a database driver like SQLite.

However, as a developer well-versed in SQL who recently had to deal with some data in a pandas DataFrame, I can strongly recommend just learning the pandas way of doing things. It's not as declarative as SQL - often what would be a single query in SQL has to be broken up into a few steps in pandas. However I found that working pandas gets easier quickly and the learning process was beneficial and satisfying.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.