2

I have a csv file which is situated on a remote unix server. I need to put that data load into a postgres db (greenplum) which is currently situated on another remote server.

Currently, I am pulling the csv into my local drive with winscp, and then loading it into greenplum remote using pgadmin with a local copy.

This seems to be a circuitous method of pulling the data into a local machine to then put it into greenplum. It is taking a long time (>100 hrs)

I think there must be a way to bulkload the remote csv to remote greenplum db without a local intervention. Has anyone some experience with this kind of data migration? I am using talend for the ETL.

Thanks!

1 Answer 1

2

Yes, there is a bulk load way to load that data from the remote server to Greenplum. It is significantly faster too.

Your Talend server will need to be networked so that it can communicate with the segment hosts in your cluster. Here is a guide on how the network should be configured: http://gpdb.docs.pivotal.io/4380/admin_guide/intro/about_loading.html

You can then use "gpload" to load the data. This is a utility that automates the tasks of starting a gpfdist process, creating an external table and performing an INSERT statement for you. Documentation on gpload: http://gpdb.docs.pivotal.io/4380/utility_guide/admin_utilities/gpload.html#topic1

Lastly, Talend is a Pivotal partner and they have lots of documentation on how to use their tools to load data into Greenplum. It leverages gpfdist to load data in parallel to the database just like gpload.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Jon! That is very helpful to know. I have been able to do basic insert with this feature. Would you mind giving me some pointers if I can do a upsert (update if record is present, insert if record is absent) with gpload? Does a merge do what I am looking for? The basic table load with a simple insert is very fast. The INSERT into X SELECT from y where conditions, I am unsure how to tread the ground. Appreciate any guidance you can provide.
gpload has a merge option but it won't handle duplicates. I'm not a fan of merge because of that. Here is another way to handle merge: pivotalguru.com/?p=104 Lastly, you mentioned putting conditions on the external table. You can certainly do that as well as transform the columns with the variety of SQL functions Greenplum has.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.