Load data from CSV-file into PostgreSQL database

Question

I have an application, that needs to load data from user-specified CSV-files into PostgreSQL database tables.

The structure of CSV-file is simple:

name,email
John Doe,[email protected]
...

In the database I have three tables:

---------------
-- CAMPAIGNS --
---------------

CREATE TABLE "campaigns" (
    "id"         serial  PRIMARY KEY,
    "name"       citext  UNIQUE CHECK ("name" ~ '^[-a-z0-9_]+$'),
    "title"      text
);

----------------
-- RECIPIENTS --
----------------

CREATE TABLE "recipients" (
    "id"           serial  PRIMARY KEY,
    "email"        citext  UNIQUE CHECK (length("email") <= 254),
    "name"         text
);


-----------------
-- SUBMISSIONS --
-----------------

CREATE TYPE "enum_submissions_status" AS ENUM (
    'WAITING',
    'SENT',
    'FAILED'
);

CREATE TABLE "submissions" (
    "id"           serial                     PRIMARY KEY,
    "campaignId"   integer                    REFERENCES "campaigns"   ON UPDATE CASCADE  ON DELETE CASCADE  NOT NULL,
    "recipientId"  integer                    REFERENCES "recipients"  ON UPDATE CASCADE  ON DELETE CASCADE  NOT NULL,
    "status"       "enum_submissions_status"  DEFAULT 'WAITING',
    "sentAt"       timestamp with time zone
);

CREATE UNIQUE INDEX "submissions_unique" ON "submissions" ("campaignId", "recipientId");
CREATE INDEX "submissions_recipient_id_index" ON "submissions" ("recipientId");

I want to read all rows from the specified CSV-file and to make sure that according records exist in recipients and submissions tables.

What would be the most performance-efficient method to load data in these tables?

This is primarily a conceptual question, I'm not asking for a concrete implementation.

First of all, I've naively tried to read and parse CSV-file line-by-line and issue SELECT/INSERT queries for each E-Mail. Obviously, it was a very slow solution that allowed me to load ~4k records per minute, but code was pretty simple and straightforward.
Now, I'm reading the CSV-file line-by-line, but aggregating all E-Mails into a batches of 1'000 elements. All SELECT/INSERT queries are made in batches using SELECT id, email WHERE email IN ('...', '...', '...', ...) constructs. Such approach increased the performance, and now I have performance of ~25k records per minute. However, this approach demanded a pretty-complex multi-step code to work.

Are there any better approaches to solve this problem and get even greater performance?

The key problem here is that I need to insert data to the recipients table first and then I need to use the generated id to create a corresponding record in the submissions table.

Also, I need to make sure that inserted E-Mails are unique. Right now, I'm using a simple array-based index in my application to prevent duplicate E-Mails from being added to the batch.

I'm writing my app using Node.js and Sequelize with Knex, however, the concrete technology doesn't matter here much.

Load data into temporary table, then use any feature of SQL/PostgreSQL you needed. — Abelisto
– Abelisto, Commented Mar 19, 2016 at 23:09
Are you familiar with the COPY (postgresql.org/docs/9.5/static/sql-copy.html) command? Bring it in to a temporary table and then use your inserts to populate the destination tables. (COPY isn't standard SQL btw) — Dmitri Goldring
– Dmitri Goldring, Commented Mar 20, 2016 at 0:11
Using COPY is the fastest way to go. See: stackoverflow.com/questions/33271377/… — vitaly-t
– vitaly-t, Commented Mar 21, 2016 at 9:07

Dibyendu Das · Accepted Answer · 2016-03-19 22:20:16Z

0

pgAdmin has GUI for data import since 1.16. You have to create your table first and then you can import data easily - just right-click on the table name and click on Import.

enter image description here

answered Mar 19, 2016 at 22:20

Dibyendu Das

1

Sign up to request clarification or add additional context in comments.

1 Comment

Slava Fomin II Over a year ago

I can't use some third-party graphical tool for this. I need to do this through my own application or at least via some kind of API.

Collectives™ on Stack Overflow

Load data from CSV-file into PostgreSQL database

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related