1

I have a csv file that I'm trying to import into my PostgreSQL database (v.10). I'm using the following basic SQL syntax:

COPY table (col_1, col_2, col_3)
FROM '/filename.csv'
DELIMITER ',' CSV HEADER
QUOTE '"'
ESCAPE '\';

First 30,000 lines or so are imported without any problem. But then I start bumping into formatting issues in the csv file that break the import:

  • Double quotes in double quotes: "value_1",""value_2"","value_3" or "value_1","val"ue_2","value_3"

The typical error I get is

ERROR: extra data after last expected column

So I started editing the csv file manually using Vim (the csv file has close to 7 million lines so can't really think of another desktop tool to use)

  • Is there anything I can do with my SQL syntax to handle those malformed strings? Using alternative ESCAPE clauses? Using regex?
  • Can you think of a way to handle those formatting issues in Vim or using another tool or function?

Thanks a lot!

1 Answer 1

1

Note that the file does not meet the CSV specification:

  1. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

You should specify a quote sign other than double-quote, for example '|':

create table test(a text, b text, c text);

copy test from '/data/example.csv' (format csv, quote '|');

select * from test;

     a     |      b      |     c     
-----------+-------------+-----------
 "value_1" | ""value_2"" | "value_3"
 "value_1" | "val"ue_2"  | "value_3"
(2 rows)

You can get rid of the unwanted double-quotes using the trim() or replace() functions, e.g.:

update test
set a = trim(a, '"'), b = trim(b, '"'), c = trim(c, '"');

select * from test;

    a    |    b     |    c    
---------+----------+---------
 value_1 | value_2  | value_3
 value_1 | val"ue_2 | value_3
(2 rows)    
Sign up to request clarification or add additional context in comments.

7 Comments

I completely agree that the file doesn't meet the CSV specification (it's compliant for probably 99.99% but the remaining 500 or so cases I fear I'll have to fix manually). Unfortunately, I only have this csv not the database behind so I cannot generate another export using other delimiters (pipe-delimited would be ideal). To complicate things, some of the double quotes in the fields are very legitimate too and would need escape characters before: "2' 10", 2' 50"" for GPS coordinates in one case or "Mark "the beast" Hogan" for a nickname or string in Hebrew that are too tough to edit
You define the quote character in a COPY command and you asked about SQL syntax, so I believe you can do it in this way. You don't have to own a table to do that. The second query is only an example, you can easily write a query to remove only first and last characters in a string if they are double-quotes.
If I change the quote character to pipe at import, Postgres doesn't import anything. ERROR: Invalid input syntax for integer: "" 1 "" on my primary key column. There are no pipes at all in my original csv. I'm not sure this really solve the problem. I'd like to tell Postgres to ignore double quotes within double quotes.
In fact, the command won't work well when the anomalies occur in columns of type other than text. Maybe you can create a temporary table to buffer the data?
You could import the data to a temporary table with text columns and then insert the data from the temp table into the destination one with necessary corrections in a single query.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.