1

I have 2 tables on my database. They both have more than 16m records, they have the same uuid for relation (I have indexes for both uuid fields). One of them is like 166GB and the other one is around 50GB. I'll change table names on my question but I hope you will get the question.

Let's say my first table is called users and the second one is profile. Now I have a field on my users table and I want to copy it to my profile table.

I've done something last night but it's still processing and been more than 10 hours already.

I have 3 questions now. First question; are my queries ok?

ALTER TABLE profiles ADD COLUMN start_stamp TIMESTAMP DEFAULT NOW();
SET start_stamp = (SELECT start_stamp::DATE FROM users WHERE uuid = profiles.uuid);
CREATE INDEX start_stamp ON profiles;

And the second question; is there any difference between these two queries? If yes, whats the difference and which one is better?

UPDATE profiles 
SET start_stamp = (SELECT start_stamp::DATE FROM users WHERE uuid = profiles.uuid);

QUERY PLAN
--------------------------------------------------------------------------
Update on profiles  (cost=0.00..159956638.61 rows=18491638 width=116)
->  Seq Scan on profiles  (cost=0.00..159956638.61 rows=18491638 width=116)
     SubPlan 1
       ->  Index Scan using unique_user_uuid on users  (cost=0.56..8.58 rows=1 width=20)
             Index Cond: ((uuid)::text = (profiles.uuid)::text)




UPDATE profile
SET start_stamp = users.start_stamp
FROM users
WHERE profiles.start_stamp = users.start_stamp;

QUERY PLAN
--------------------------------------------------------------------------
Update on profiles  (cost=2766854.25..5282948.42 rows=11913522 width=142)
->  Hash Join  (cost=2766854.25..5282948.42 rows=11913522 width=142)
     Hash Cond: ((profiles.uuid)::text = (users.uuid)::text)
     ->  Seq Scan on profiles  (cost=0.00..1205927.56 rows=18491656 width=116)
     ->  Hash  (cost=2489957.22..2489957.22 rows=11913522 width=63)
           ->  Seq Scan on users  (cost=0.00..2489957.22 rows=11913522 width=63)

And my final question is; is there a better way to copy a value from a table to another with more than 16m and 200gb records?

Thanks.

8
  • You've really asked three questions here. For the two updates, both are logically identical, but I'm not sure if the first version would even run on Postgres. The second version is the standard update join syntax. In terms of performance, you may check the execution plans of both updates, assuming both run. Commented Oct 16, 2018 at 6:26
  • I assume the mismatching WHERE condition in the latter one is a mistype and not actually meant that way Commented Oct 16, 2018 at 6:26
  • @a_horse_with_no_name Bad choice of words. I should have said something like "the typical way to do an update join in Postgres syntax." Does that work better? Yes, second one completely ANSI SQL. Commented Oct 16, 2018 at 6:31
  • The first query will only work if users.uuid is unique Commented Oct 16, 2018 at 6:31
  • Yes, uuid field is unique for each record. But every profile and user has the same uuid on different tables. But the point is that query s still working since last night and i dont know how long will it take more. That was the main reason to ask here. Can i do it faster or should i wait for it? Commented Oct 16, 2018 at 6:33

2 Answers 2

1

The fastest way to update/copy huge amount of data is CTAS (create table as select). It is only possible if you have rights to do so and you can change names or drop original table.

In your case it would be like this:

create table tmp_profiles as
select p.* , us.strat_stamp:date
 from profiles p
 left join users u on p.uuid = us.uuid;

drop table profiles;

alter table tmp_profiles, rename to profiles;

After that you have to recreate your keys, indexes, and other constraints.

If you update more then 5% of records in your table then CTAS will be at least few times faster then regular update. Below that threshold update can be faster then CTAS.

Sign up to request clarification or add additional context in comments.

3 Comments

i've tried to do that, it's been 4 hours and still processsing :/
This is the fastest way. I doubt if you will find any faster.
Do you have HDD or SATA disc? If HDD it will take a long time to rewrite over 166 GB.
0

Both of your queries are the same. It will take forever to update. This is a well-known problem adding NOT NULL COLUMN to the bigger table

Sol1: Updating defaults in chunks, Running multiple queries to update the date Sol2: Recreate the entire table

Useful Links for the large number of rows in Postgres: https://medium.com/doctolib-engineering/adding-a-not-null-constraint-on-pg-faster-with-minimal-locking-38b2c00c4d1c

https://dba.stackexchange.com/questions/52517/best-way-to-populate-a-new-column-in-a-large-table/52531#52531

https://dba.stackexchange.com/questions/41059/optimizing-bulk-update-performance-in-postgresql

2 Comments

i've tried to create the entire table with create as select, it's been 4 hours and still processing :/
@htunc Have you locked the table before inserting because there are other oeprations can block this . Please follow this dba.stackexchange.com/questions/52517/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.