0

I'm using Postgresql 10 and I need to do a lookup in the CSV file and compare the entries in the CSV file with the entries in my postgres table. The database looks likes this, where I have to insert the domain name in domains table and ranks in ranks table:

CREATE TABLE lists (list_id integer PRIMARY KEY,
                    list_name text);

CREATE TABLE domains (domain_id BIGSERIAL PRIMARY KEY,
                      domain_name text UNIQUE);

CREATE TABLE ranks (list_id integer REFERENCES lists,
                    domain_id integer REFERENCES domains,
                    rank integer,
                    date date,
                    PRIMARY KEY (list_id, rank, date));

The csv contains two entries, a rank and a domain name like this: "1, google.com"

Currently I insert the domainnames into the domain table, where the domain id is auto incremented and serves as a primary key. Then I want to insert the ranks into the ranks table. But I'm struggeling to get the domain_id from the domains table into the ranks table as the domain_id serves as a foreign key in the ranks table. So I want to check the CSV for the domain name, check it up against the domains table and get out the domain_id for each domain as i insert the ranks. So each domain name can have several ranks, this is made distinct by the date.

The current script I'm using now looks like this:

    import tkinter as tk
    from tkinter import filedialog
    import csv
    import psycopg2
    import shutil as sh

    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.askopenfilename()
    new_path = 'C:/Users/%user%/Desktop/alexa-top1m_16042018.csv'

    conn = psycopg2.connect("host=localhost dbname=test user=postgres   password=test")
    cur = conn.cursor()

    sh.copy2(file_path, new_path)
    with open(new_path, 'r') as original: data = original.read()
    with open(new_path, 'w') as modified: modified.write("rank,domain_name\n" + data)

    with open(new_path, 'r') as f:
        reader = csv.DictReader(f)


    for row in reader:
        cur.execute(
           """INSERT INTO ranks (list_id, rank, date) VALUES (%s, %s, %s);""", ( 1, row['rank'], '2018-04-16',)
        )

   conn.commit()

Im using psycopg2 to connect to the DB and make queries.

Do anyone know how to do this, or have any other suggestions on how to achieve this?

1 Answer 1

1

You could create a temporary table that will hold the CSV data and use SQL queries to insert the data in the domains and ranks tables.

Here is the code for the temporary table:

CREATE TABLE temporary_table (
  rank INTEGER,
  domain TEXT
);

Fill this table with the CSV data.

Now, insert the domains that are present in the CSV file but not present in the domains table.

INSERT INTO domains (domain_name)
  (SELECT DISTINCT domain as domain_name FROM temporary_table
    EXCEPT
  SELECT domain_name FROM domains);

Now, when you have all of the existing domains in the domains table, we can insert the rows in the ranks table.

INSERT INTO ranks (list_id, domain_id, rank, date)
    SELECT 1 as list_id, d.domain_id, rank, now()::DATE 
    FROM temporary_table tt JOIN domains d ON tt.domain = d.domain_name;

In order to get the domain id for the rank we are inserting, we do a join between the temporary_table and the domains table by domain name. This way, we can find the domain_id for each rank.

Notice that I added 1 as list_id, and now()::date in the ranks insert because you didn't provide columns from which that data should be extracted.

Also, be careful with the combined primary key PRIMARY KEY (list_id, rank, date). If you want to insert ranks for multiple domains on the same date, and some of the domains have the same rank and list_id values, you will get an duplicate key value error and the data won't be inserted. In order to fix this, you can add the domain_id in the combined primary key as well.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the answer it seemd to work. If I can bother you with another questions. Do you know how I can store the domain names as lowercase and punycode?
You could make the names as lowercase by using the lower function. For example, SELECT lower('TesT') returns "test" as a result. I am not sure if there is a function that converts to punnycode. Maybe you should ask another question about that so someone with more experience can give you a proper answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.