I'm facing efficiency issues while uploading data to an AWS RDS database. Daily, I process a CSV file with about 6,000 lines using Python's 'pymysql' package. The process involves reading each line and selectively uploading data based on column indices. Although only 10 pieces of information are updated per iteration, the entire operation takes over 10 hours to complete.
Could anyone suggest ways to optimize this process? My current code is attached below for reference:
import pandas as pd
import pymysql
from datetime import datetime
import time
# Connect to the database
conn = pymysql.connect(host='hostURL',user='user', password='pw')
cursor = conn.cursor()
cursor.execute("USE DB")
# Read your data
date_org = datetime(2023,11,6).strftime("%Y%m%d")
path = 'C://local_code_run//data//'
resPath = path + 'RandomForest_output//' + '20231106_preds.csv'
data = pd.read_csv(resPath)
data = data.where(pd.notnull(data), None)
# Explicitly convert NaNs in float columns to None
data['Individual_Station_Model_Pred'] = data['Individual_Station_Model_Pred'].fillna(0)
# SQL
sql = '''
UPDATE PM25_Predictions
SET Lucas_ML_All = %s, Lucas_ML_One = %s
WHERE stationid = %s
AND YEAR(UTC) = %s AND MONTH(UTC) = %s AND DAY(UTC) = %s
AND YEAR(Forecast) = %s AND MONTH(Forecast) = %s AND DAY(Forecast) = %s AND HOUR(Forecast) = %s
'''
# Upload
for row in data.itertuples():
utc_year, utc_month, utc_day = date_org[:4], date_org[4:6], date_org[6:8]
forecast_year, forecast_month, forecast_day, forecast_hour = str(row.UTC_DATE)[:4], str(row.UTC_DATE)[4:6], str(row.UTC_DATE)[6:8], str(row.UTC_TIME)[:-2]
params = (
row.All_Station_Model_Pred,
row.Individual_Station_Model_Pred,
row.Station,
utc_year, utc_month, utc_day,
forecast_year, forecast_month, forecast_day, forecast_hour
)
try:
cursor.execute(sql, params)
except pymysql.MySQLError as e:
print("Error while updating record:", e)
conn.rollback() # Rollback in case of error
else:
conn.commit() # Commit the transaction
Thanks in advance for your help!
I attempted to compile the data into a table and upload it all at once, but this method also resulted in a lengthy process. Therefore, I would appreciate learning about more efficient approaches from those who have experience with similar issues.
cur.executemanyto see if that yields a favorable runtime optimization. It might also be that there is no index on stationid, year, month, day, etc. in your database. Adding these indices will help with the row lookup time (and therefore the row update time) as wellSHOW CREATE TABLE PM25_Predictionsand include the result in your question above, so we don't have to guess at your column data types or indexes? Please use text, not a screenshot.SELECT VERSION();and include that information.