7

I store my data in Postgresql server. I want to load a table which has 15mil rows to data.frame or data.table

I use RPostgreSQL to load data.

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, ...)

# Select data from a table
system.time(
df <- dbGetQuery(con, "SELECT * FROM 15mil_rows_table")
)

It took 20 minutes to load data from DB to df. I use google cloud server which have 60GB ram and 16 Core CPU

What should I do to reduce load time?

5
  • Do you need to load all the data into R? Might there be operations you could do directly in PostgreSQL - filter or aggregate rows, for instance? Commented Mar 29, 2015 at 11:37
  • 2
    @DominicComtois: I need to load all data into R because I want to do many aggregations which is not easy to do in PostgreSQL. I have completed my R code. Now I want to improve the load data part Commented Mar 29, 2015 at 11:39
  • 2
    if you use src_postgres from dplyr you can then use dplyr functions for the aggregation and it will push many if not all of those operations back onto the database itself and you won't need to read all the records into R. ref: cran.rstudio.com/web/packages/dplyr/vignettes/databases.html Commented Mar 29, 2015 at 12:07
  • @hrbrmstr: Thank you for your advice. I will try dplyr later. I am still want to find the answer for my question because I want to load all data to aggression and plot data. Commented Mar 30, 2015 at 2:36
  • you should be able to use dplyr to load the data into R. failing that, you should be able to write your data to csv and use fread in data.table to read it very quickly. limiting step is probably not RAM or cores -- it will be the disk read. Commented Apr 9, 2015 at 5:37

2 Answers 2

4

Not sure if this will reduce load time, for sure it may reduce load time as both processes are quite performance efficient. You can leave a comment about the timming.

  1. using bash run psql as dump table to csv:

COPY 15mil_rows_table TO '/path/15mil_rows_table.csv' DELIMITER ',' CSV HEADER;
  1. in R just fread it:

library(data.table)
DT <- fread("/path/15mil_rows_table.csv")
Sign up to request clarification or add additional context in comments.

1 Comment

I use this method to reduce load time also but I zip data to save memory storage. I add my answer below.
3

I use the method as @Jan Gorecki with zip data to save memory.

1- Dump table to csv

psql -h localhost -U user -d 'database' -c "COPY 15mil_rows_table TO stdout DELIMITER ',' CSV HEADER" | gzip > 15mil_rows_table.csv.gz &

2- Load data in R

DT <- fread('zcat 15mil_rows_table.csv.gz')

2 Comments

Can you share a total timing using this way? May it could be further improved by stdout-in instead csv file, but not sure.
@JanGorecki: My table has 40mil rows and 11 columns. The load time is about 5 minutes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.