Most efficient way to remove duplicates - Postgres

Question

I have always deleted duplicates with this kind of query:

delete from test a
using test b 
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3

Also, I have seen this query being used:

DELETE FROM test WHERE test.ctid NOT IN 
(SELECT ctid FROM (
    SELECT DISTINCT ON (col1, col2) *
  FROM test));

And even this one (repeated until you run out of duplicates):

delete from test ju where ju.ctid in 
(select ctid from (
select  distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1

Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:

Which, of all those methods that apparently do the same, is the most efficient and why? I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.

Comparison of ctid values is quite slow, do you have another way of uniquely identifying a row? Maybe a serial (or identity) column? — user330315
– user330315, Commented Dec 11, 2018 at 10:32
I can always create a serial ID, that is no problem. I just killed the second example after 45 minutes of running time, while the first one finished after just 50 seconds. I'm pretty sure both of them remove duplicates, leaving just one single row. Why does that huge time difference happen, given that both do the same thing? — A.T.
– A.T., Commented Dec 11, 2018 at 10:42

S-Man · Accepted Answer · 2018-12-11 11:54:28Z

28

demo:db<>fiddle

Finding duplicates can be easily achieved by using row_number() window function:

SELECT ctid 
FROM(
    SELECT 
        *, 
        ctid,
        row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
    FROM test
)s
WHERE row_number >= 2

This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:

DELETE 
FROM test
WHERE ctid IN 
(
    SELECT ctid 
    FROM(
        SELECT 
            *, 
            ctid,
            row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
        FROM test
    )s
    WHERE row_number >= 2
)

I don't know if this solution is faster than your attempts but your could give it a try.

Furthermore - as @a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.

Edit:

For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).

demo:db<>fiddle

edited Dec 11, 2018 at 11:54

answered Dec 11, 2018 at 10:38

S-Man

24k9 gold badges51 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

A.T. Over a year ago

I have tested your version and it works better than the first one I tried, since, for some reason, mine doesn't delete some duplicates. Also, thank you very much, I didn't know about Partition. I don't quite understand how your query doesn't delete all rows that are duplicated, but it leaves one row for each set of duplicates, but it works.

S-Man Over a year ago

Please have a look at my fiddle. The 2nd section shows what row_number does. It adds a counter from 1 to 3 for every duplicate row. So the first row with row_number = 1 can be seen as the "original" one and all following are duplicates. The next step (3rd section) is to filter all rows with row_number >= 2 (or NOT 1). So the first rows are not selected but all others. These selected ones can be deleted. The row_number = 1 rows stay.

A.T. Over a year ago

Thanks for the explanation. I tested it on my table and it works.

HiTech Over a year ago

Great! But "WHERE ctid IN" is too slow for my 70kk table, using PRIMARY KEY is much faster than ctid.

Laenka-Oss Over a year ago

Also, what about trying this: stackoverflow.com/a/66659351/8523960 ?

|

Collectives™ on Stack Overflow

Most efficient way to remove duplicates - Postgres

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related