1

I have CSV File which is quite large (few hundred MBs) which I am trying to import into Postgres Table, problem arise when there, is some primary key violation (duplicate record in CSV File)

If it has been one I could manually filter out those records, but these files are generated by a program which generate such data every hour. My script has to automatically import it to database.

My question is: Is there some way out that I can set a flag in COPY command or in Postgres so It can skip the duplicate records and continue importing file to table?

1 Answer 1

3

My thought would be to approach this in two ways:

  1. Use a utility that can help create an "exception report" of duplicate rows, such as this one during the COPY process.
  2. Change your workflow by loading the data into a temp table first, massaging it for duplicates (maybe JOIN with your target table and mark all existing in the temp with a dup flag), and then only import the missing records and send the dups to an exception table.

I personally prefer the second approach, but that's a matter of specific workflow in your case.

Sign up to request clarification or add additional context in comments.

4 Comments

I'd go for the second solution as well (or use a different tool to load the data)
I'd go for the second solution
I tried @dawebber's second approach but as the database size increases per import, further imports are delayed and import time increases as the table grows.
@regexhacks, which part of the workflow is seeing a slowdown?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.