I have always deleted duplicates with this kind of query:
delete from test a
using test b
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3
Also, I have seen this query being used:
DELETE FROM test WHERE test.ctid NOT IN
(SELECT ctid FROM (
SELECT DISTINCT ON (col1, col2) *
FROM test));
And even this one (repeated until you run out of duplicates):
delete from test ju where ju.ctid in
(select ctid from (
select distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1
Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:
Which, of all those methods that apparently do the same, is the most efficient and why? I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.
ctidvalues is quite slow, do you have another way of uniquely identifying a row? Maybe aserial(oridentity) column?