0

I'm facing efficiency issues while uploading data to an AWS RDS database. Daily, I process a CSV file with about 6,000 lines using Python's 'pymysql' package. The process involves reading each line and selectively uploading data based on column indices. Although only 10 pieces of information are updated per iteration, the entire operation takes over 10 hours to complete.

Could anyone suggest ways to optimize this process? My current code is attached below for reference:

import pandas as pd
import pymysql
from datetime import datetime
import time


# Connect to the database
conn = pymysql.connect(host='hostURL',user='user', password='pw')
cursor = conn.cursor()
cursor.execute("USE DB")

# Read your data
date_org = datetime(2023,11,6).strftime("%Y%m%d")
path = 'C://local_code_run//data//'
resPath = path + 'RandomForest_output//' + '20231106_preds.csv'
data = pd.read_csv(resPath)
data = data.where(pd.notnull(data), None)

# Explicitly convert NaNs in float columns to None
data['Individual_Station_Model_Pred'] = data['Individual_Station_Model_Pred'].fillna(0)

# SQL
sql = '''
            UPDATE PM25_Predictions
            SET Lucas_ML_All = %s, Lucas_ML_One = %s
            WHERE stationid = %s
            AND YEAR(UTC) = %s AND MONTH(UTC) = %s AND DAY(UTC) = %s
            AND YEAR(Forecast) = %s AND MONTH(Forecast) = %s AND DAY(Forecast) = %s AND HOUR(Forecast) = %s
          '''

# Upload
for row in data.itertuples():
    utc_year, utc_month, utc_day = date_org[:4], date_org[4:6], date_org[6:8]
    forecast_year, forecast_month, forecast_day, forecast_hour = str(row.UTC_DATE)[:4], str(row.UTC_DATE)[4:6], str(row.UTC_DATE)[6:8], str(row.UTC_TIME)[:-2]
    params = (
            row.All_Station_Model_Pred,
            row.Individual_Station_Model_Pred,
            row.Station,
            utc_year, utc_month, utc_day,
            forecast_year, forecast_month, forecast_day, forecast_hour
        )

    try:
        cursor.execute(sql, params)
    except pymysql.MySQLError as e:
        print("Error while updating record:", e)
        conn.rollback()  # Rollback in case of error
    else:
        conn.commit()  # Commit the transaction

Thanks in advance for your help!

I attempted to compile the data into a table and upload it all at once, but this method also resulted in a lengthy process. Therefore, I would appreciate learning about more efficient approaches from those who have experience with similar issues.

3
  • I don't usually work with MySQL (I usually use PostGreSQL), so I'm not sure if this will apply. But I would recommend that you try cur.executemany to see if that yields a favorable runtime optimization. It might also be that there is no index on stationid, year, month, day, etc. in your database. Adding these indices will help with the row lookup time (and therefore the row update time) as well Commented Nov 10, 2023 at 15:37
  • Can you please query SHOW CREATE TABLE PM25_Predictions and include the result in your question above, so we don't have to guess at your column data types or indexes? Please use text, not a screenshot. Commented Nov 10, 2023 at 15:50
  • Also run SELECT VERSION(); and include that information. Commented Nov 10, 2023 at 15:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.