I have an application, that needs to load data from user-specified CSV-files into PostgreSQL database tables.
The structure of CSV-file is simple:
name,email
John Doe,[email protected]
...
In the database I have three tables:
---------------
-- CAMPAIGNS --
---------------
CREATE TABLE "campaigns" (
"id" serial PRIMARY KEY,
"name" citext UNIQUE CHECK ("name" ~ '^[-a-z0-9_]+$'),
"title" text
);
----------------
-- RECIPIENTS --
----------------
CREATE TABLE "recipients" (
"id" serial PRIMARY KEY,
"email" citext UNIQUE CHECK (length("email") <= 254),
"name" text
);
-----------------
-- SUBMISSIONS --
-----------------
CREATE TYPE "enum_submissions_status" AS ENUM (
'WAITING',
'SENT',
'FAILED'
);
CREATE TABLE "submissions" (
"id" serial PRIMARY KEY,
"campaignId" integer REFERENCES "campaigns" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"recipientId" integer REFERENCES "recipients" ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
"status" "enum_submissions_status" DEFAULT 'WAITING',
"sentAt" timestamp with time zone
);
CREATE UNIQUE INDEX "submissions_unique" ON "submissions" ("campaignId", "recipientId");
CREATE INDEX "submissions_recipient_id_index" ON "submissions" ("recipientId");
I want to read all rows from the specified CSV-file and to make sure that according records exist in recipients and submissions tables.
What would be the most performance-efficient method to load data in these tables?
This is primarily a conceptual question, I'm not asking for a concrete implementation.
First of all, I've naively tried to read and parse CSV-file line-by-line and issue
SELECT/INSERTqueries for each E-Mail. Obviously, it was a very slow solution that allowed me to load ~4k records per minute, but code was pretty simple and straightforward.Now, I'm reading the CSV-file line-by-line, but aggregating all E-Mails into a batches of 1'000 elements. All
SELECT/INSERTqueries are made in batches usingSELECT id, email WHERE email IN ('...', '...', '...', ...)constructs. Such approach increased the performance, and now I have performance of ~25k records per minute. However, this approach demanded a pretty-complex multi-step code to work.
Are there any better approaches to solve this problem and get even greater performance?
The key problem here is that I need to insert data to the recipients table first and then I need to use the generated id to create a corresponding record in the submissions table.
Also, I need to make sure that inserted E-Mails are unique. Right now, I'm using a simple array-based index in my application to prevent duplicate E-Mails from being added to the batch.
I'm writing my app using Node.js and Sequelize with Knex, however, the concrete technology doesn't matter here much.
COPYis the fastest way to go. See: stackoverflow.com/questions/33271377/…