0

I have an application, that needs to load data from user-specified CSV-files into PostgreSQL database tables.

The structure of CSV-file is simple:

name,email
John Doe,[email protected]
...

In the database I have three tables:

---------------
-- CAMPAIGNS --
---------------

CREATE TABLE "campaigns" (
    "id"         serial  PRIMARY KEY,
    "name"       citext  UNIQUE CHECK ("name" ~ '^[-a-z0-9_]+$'),
    "title"      text
);

----------------
-- RECIPIENTS --
----------------

CREATE TABLE "recipients" (
    "id"           serial  PRIMARY KEY,
    "email"        citext  UNIQUE CHECK (length("email") <= 254),
    "name"         text
);


-----------------
-- SUBMISSIONS --
-----------------

CREATE TYPE "enum_submissions_status" AS ENUM (
    'WAITING',
    'SENT',
    'FAILED'
);

CREATE TABLE "submissions" (
    "id"           serial                     PRIMARY KEY,
    "campaignId"   integer                    REFERENCES "campaigns"   ON UPDATE CASCADE  ON DELETE CASCADE  NOT NULL,
    "recipientId"  integer                    REFERENCES "recipients"  ON UPDATE CASCADE  ON DELETE CASCADE  NOT NULL,
    "status"       "enum_submissions_status"  DEFAULT 'WAITING',
    "sentAt"       timestamp with time zone
);

CREATE UNIQUE INDEX "submissions_unique" ON "submissions" ("campaignId", "recipientId");
CREATE INDEX "submissions_recipient_id_index" ON "submissions" ("recipientId");

I want to read all rows from the specified CSV-file and to make sure that according records exist in recipients and submissions tables.

What would be the most performance-efficient method to load data in these tables?

This is primarily a conceptual question, I'm not asking for a concrete implementation.


  • First of all, I've naively tried to read and parse CSV-file line-by-line and issue SELECT/INSERT queries for each E-Mail. Obviously, it was a very slow solution that allowed me to load ~4k records per minute, but code was pretty simple and straightforward.

  • Now, I'm reading the CSV-file line-by-line, but aggregating all E-Mails into a batches of 1'000 elements. All SELECT/INSERT queries are made in batches using SELECT id, email WHERE email IN ('...', '...', '...', ...) constructs. Such approach increased the performance, and now I have performance of ~25k records per minute. However, this approach demanded a pretty-complex multi-step code to work.

Are there any better approaches to solve this problem and get even greater performance?


The key problem here is that I need to insert data to the recipients table first and then I need to use the generated id to create a corresponding record in the submissions table.

Also, I need to make sure that inserted E-Mails are unique. Right now, I'm using a simple array-based index in my application to prevent duplicate E-Mails from being added to the batch.

I'm writing my app using Node.js and Sequelize with Knex, however, the concrete technology doesn't matter here much.

3
  • Load data into temporary table, then use any feature of SQL/PostgreSQL you needed. Commented Mar 19, 2016 at 23:09
  • 2
    Are you familiar with the COPY (postgresql.org/docs/9.5/static/sql-copy.html) command? Bring it in to a temporary table and then use your inserts to populate the destination tables. (COPY isn't standard SQL btw) Commented Mar 20, 2016 at 0:11
  • Using COPY is the fastest way to go. See: stackoverflow.com/questions/33271377/… Commented Mar 21, 2016 at 9:07

1 Answer 1

0

pgAdmin has GUI for data import since 1.16. You have to create your table first and then you can import data easily - just right-click on the table name and click on Import.

enter image description here

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

I can't use some third-party graphical tool for this. I need to do this through my own application or at least via some kind of API.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.