2

I am trying to upload part of a text file into a database table. The text file is around 12 GB. Screenshot of the text file. I am parsing the text file line by line and inserting it into the table.

The following is the code that I am using to upload the data:

import psycopg2 as pg
import os
import datetime

sub_column_list = ['', 'SUB', 'GIS', 'MO', 'DA', 'YR', 'AREAkm2', 'PRECIPmm', 'SNOMELTmm', 'PETmm', 'ETmm', 'SWmm', 'PERCmm',
          'SURQmm', 'GW_Qmm', 'WYLDmm', 'SYLDt/ha', 'ORGNkg/ha', 'ORGPkg/ha', 'NSURQkg/ha', 'SOLPkg/ha',
          'SEDPkg/ha', 'LATQmm', 'LATNO3kg/ha', 'GWNO3kg/ha', 'CHOLAmic/L', 'CBODUmg/L', 'DOXQmg/L', 'TNO3kg/ha']

sub_vars = ['PRECIPmm', 'PETmm', 'ETmm', 'SWmm', 'SURQmm']

conn = psycopg2.connect('dbname=swat_db user=admin password=pass host=localhost port=5435')

cur = conn.cursor()

watershed_id = 1
if file.endswith('.sub'):

    sub_path = os.path.join(output_path, file)
    f = open(sub_path)
    for skip_line in f:
        if 'AREAkm2' in skip_line:
            break

    for num, line in enumerate(f, 1):
        line = str(line.strip())
        columns = line.split()
        for idx, item in enumerate(sub_vars):
            sub = int(columns[1])
            dt = datetime.date(int(columns[5]), int(columns[3]), int(columns[4]))
            var_name = item
            val = float(columns[sub_column_list.index(item)])
            cur.execute("""INSERT INTO output_sub (watershed_id, month_day_year, sub_id, var_name, val)
                         VALUES ({0}, '{1}', {2}, '{3}', {4})""".format(watershed_id, dt, sub, var_name, val))

        conn.commit()
    conn.close()

The sub_column_list is the list of all the columns in the text file. The sub_vars list is a list of the variables that I would like to put into the database. This approach is taking a very long time to insert the values into the database. What would be a good way to improve the speed at which the values are inserted into the database?

1 Answer 1

1

The first I notice is you are going through the file twice. Once for AREAkm2 Search and then again you start over and begin dumping into your database. Maybe this is what you wanted?

if file.endswith('.sub'):

    sub_path = os.path.join(output_path, file)
    f = open(sub_path)
    for num, line in enumerate(f):
        if 'AREAkm2' in line:
            continue
        line = str(line.strip())
        columns = line.split()
        for idx, item in enumerate(sub_vars):
            sub = int(columns[1])
            dt = datetime.date(int(columns[5]), int(columns[3]), int(columns[4]))
            var_name = item
            val = float(columns[sub_column_list.index(item)])
            cur.execute("""INSERT INTO output_sub (watershed_id, month_day_year, sub_id, var_name, val)
                         VALUES ({0}, '{1}', {2}, '{3}', {4})""".format(watershed_id, dt, sub, var_name, val))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.