1

I have a python script that I am using using to make sql queries. The problem is that my VM only has 2GBs of RAM and some of the sql queries are too RAM intensive and therefore the kernel automatically kills the script. How can I make this code more RAM efficient? I would like to implement pagination in my postgres sql code. How would I do that? Does anyone know an easy implementation of that? I would greatly appreciate your help!

Updated Code

from __future__ import print_function

try:
    import psycopg2
except ImportError:
    raise ImportError('\n\033[33mpsycopg2 library missing. pip install psycopg2\033[1;m\n')
    sys.exit(1)


import re
import sys
import json
import pprint
import time

outfilepath = "crtsh_output/crtsh_flat_file"

DB_HOST = 'crt.sh'
DB_NAME = 'certwatch'
DB_USER = 'guest'

# DELAY = 0


def connect_to_db():
    start = 0
    offset = 10
    flag = True
    while flag:
        filepath = 'forager.txt'
        with open(filepath) as fp:
            unique_domains = ''
            try:
                conn = psycopg2.connect("dbname={0} user={1} host={2}".format(DB_NAME, DB_USER, DB_HOST))
                cursor = conn.cursor()
                cursor.itersize = 10000
                for cnt, domain_name in enumerate(fp):
                    print("Line {}: {}".format(cnt, domain_name))
                    print(domain_name)
                    domain_name = domain_name.rstrip()

                    cursor.execute('''SELECT c.id, x509_commonName(c.certificate), x509_issuerName(c.certificate), x509_notBefore(c.certificate), x509_notAfter(c.certificate), x509_issuerName(c.certificate), x509_keyAlgorithm(c.certificate), x509_keySize(c.certificate), x509_publicKeyMD5(c.certificate), x509_publicKey(c.certificate), x509_rsaModulus(c.certificate), x509_serialNumber(c.certificate), x509_signatureHashAlgorithm(c.certificate), x509_signatureKeyAlgorithm(c.certificate), x509_subjectName(c.certificate), x509_name(c.certificate), x509_name_print(c.certificate), x509_commonName(c.certificate), x509_subjectKeyIdentifier(c.certificate), x509_extKeyUsages(c.certificate), x509_certPolicies(c.certificate), x509_canIssueCerts(c.certificate), x509_getPathLenConstraint(c.certificate), x509_altNames(c.certificate), x509_altNames_raw(c.certificate), x509_cRLDistributionPoints(c.certificate), x509_authorityInfoAccess(c.certificate), x509_print(c.certificate), x509_anyNamesWithNULs(c.certificate), x509_extensions(c.certificate), x509_tbscert_strip_ct_ext(c.certificate), x509_hasROCAFingerprint(c.certificate)
                    FROM certificate c, certificate_identity ci WHERE
                    c.id= ci.certificate_id AND ci.name_type = 'dNSName' AND lower(ci.name_value) =
                    lower(%s) AND x509_notAfter(c.certificate) > statement_timestamp()''', (domain_name,))


                # query db with start and offset
                unique_domains = cursor.fetchall()
                if not unique_domains:
                    flag = False
                else:
                        # do processing with your data

                    pprint.pprint(unique_domains)

                    outfilepath = "crtsh2" + ".json"
                    with open(outfilepath, 'a') as outfile:
                            outfile.write(json.dumps(unique_domains, sort_keys=True, indent=4, default=str, ensure_ascii = False))
                    offset += limit


            except Exception as error:
                print(str(error))

if __name__ == "__main__":
    connect_to_db()
9
  • use something like cur.fetchmany(n). Returns the next 'n' rows from your query. Commented Aug 8, 2018 at 13:18
  • @Mokadillion Thank you for your response! In which section of my code should I implement cur.fetchmany(n)? Commented Aug 8, 2018 at 13:41
  • One odd thing is that the cursor.fetchall() is called outside the loop - which means the database never gets a chance to close resources consumed by prior runs of the query. You should process query results for each execute() call - append to a list, update a set, etc. Commented Aug 8, 2018 at 13:59
  • @bimsapi Thank you for your help! For cursor.fetchall() to be inside the loop, does it need to be indented once? Commented Aug 8, 2018 at 14:05
  • Yes - indent to the same level as the other statements in the loop. I would also avoid opening and closing crtsh2.json on each iteration. For simplicity, manage both files at the same time via with open(filepath) as fp, open('crtsh2.json') as outfile: Commented Aug 8, 2018 at 14:17

2 Answers 2

3

may be something like this:

limit = 10
offset = 0
flag = True
while flag:
    # query db with start and offset, example: select * from domains limit %start% offset %offset%
    unique_domains = cursor.fetchall()
    if not unique_domains:
        flag = False
    else:
        # do processing with your data
        offset += limit
Sign up to request clarification or add additional context in comments.

5 Comments

@Ajay_Gupta Thank you for your response. Are you saying I should put this code in place of where I have "unique_domains = cursor.fetchall()"?
@bedford this is a pseudo code, you have to query based on number of records you want to fetch defined in limit.
@Ajay_Gupta Thank you again for your response! What exactly do you mean by this being pseudo code?
@bedford it just mean that you have to insert your code into this logical code
@Ajay_Gupta I updated my code above. However, my code does not work. It just stalls on the first query and does nothing. Any help would be appreciated!
1

I found a link to paginate in Postgres. Five ways to paginate in Postgres, from the basic to the exotic

Here's an example: Keyset Pagination The techniques above can paginate any kind of query, including queries without order clauses. If we are willing to forgo this generality we reap optimizations. In particular when ordering by indexed column(s) the client can use values in the current page to choose which items to show in the next page. This is called keyset pagination.

For example let’s return to the medley example:

-- Add an index for keyset pagination (btrees support inequality)
CREATE INDEX n_idx ON medley USING btree (n);
SELECT * FROM medley ORDER BY n ASC LIMIT 5;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.