Insert Bulk of data to Postgresql

Question

I am suffering from performance issues when inserting a milion rows into a PostgreSQL database.

I am sending a JSON object which has an array with a milion rows.

For each row i create a record in the database table. I also tried this with multiple insert at once, but the problem remains.

I am not sure how to handle this, I have read that the COPY command is fastests.

How can I improve the performance?

My JSON Object with log as an array: array Log has a milion rows.

{"type":"monitoring","log":[
["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}

My current code(I am building a dynamic statement so that i can execute multiple rows at once):

IF(NOT b_first_line) THEN
          s_insert_query_values = right(s_insert_query_values, -1); --remove te leading comma

          EXECUTE format('INSERT INTO log_rlda
                  (record_node_id, log_line, log_value, timestamp, record_log_id)
          VALUES %s;', s_insert_query_values);

          s_insert_query_values = '';
          i_num_lines_buffered = 0;
        END IF;
      END IF;

s_insert_query_values contains:

Each value inside the array within "log", needs to be inserted in its own row(in colum: log_value). This is how the INSERT looks like(reference to s_insert_query_values):

INSERT INTO log_rlda
                  (record_node_id, log_line, log_value, timestamp, record_log_id)
          VALUES  
     (806, 1, 0.00, '2016-10-12 20:33:21', 386),
     (807, 1, 0.00, '2016-10-12 20:33:21', 386),
     (808, 1, 0.00, '2016-10-12 20:33:21', 386),
     (809, 1, 0.00, '2016-10-12 20:33:21', 386),
     (810, 1, 0.0, '2016-10-12 20:33:21', 386),
     (811, 1, 24.00, '2016-10-12 20:33:21', 386),
     (768, 1, 1.83, '2016-10-12 20:33:21', 386),
     (769, 1, 0.00, '2016-10-12 20:33:21', 386),
     (728, 1, 1, '2016-10-12 20:33:21', 386),
     (771, 1, 1, '2016-10-12 20:33:21', 386),
     (729, 1, -100.00, '2016-10-12 20:33:21', 386),
     (806, 2, 0.00, '2016-10-12 20:33:23', 386),
     (807, 2, 0.00, '2016-10-12 20:33:23', 386),
     (808, 2, 0.00, '2016-10-12 20:33:23', 386),
     (809, 2, 0.00, '2016-10-12 20:33:23', 386),
     (810, 2, 0.0, '2016-10-12 20:33:23', 386),
     (811, 2, 24.00, '2016-10-12 20:33:23', 386),
     (768, 2, 1.52, '2016-10-12 20:33:23', 386),
     (769, 2, -0.61, '2016-10-12 20:33:23', 386),
     (728, 2, 1, '2016-10-12 20:33:23', 386),
     (771, 2, 1, '2016-10-12 20:33:23', 386),
     (729, 2, -100.00, '2016-10-12 20:33:23', 386)

Solution(i_node_id_list contains ID's i selected before this query):

SELECT i_node_id_list[log_value_index] AS record_node_id,
                    e.log_line-1 AS log_line,
                    items.log_value::double precision as log_value,
                    to_timestamp((e.line->>0)::text, 'YYYY-MM-DD HH24:MI:SS') as "timestamp",
                    i_log_id as record_log_id
              FROM (VALUES (log_data::json)) as data (doc),
                json_array_elements(doc->'log') with ordinality as e(line, log_line),
                json_array_elements_text(e.line)     with ordinality as items(log_value, log_value_index)
              WHERE  log_value_index > 1 --dont include timestamp value (shouldnt be written as log_value)
              AND  log_line  > 1

score 1 · Accepted Answer · 2016-10-21 10:52:52Z

1

You need two levels of unnesting.

select e.log_line, items.log_value, e.line -> 0 as timestamp
from (
  values ('{"type":"monitoring","log":[
  ["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
  ["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}'::json)
) as data (doc), 
  json_array_elements(doc->'log') with ordinality as e(line, log_line), 
  json_array_elements(e.line)   with ordinality as items(log_value, log_value_index)
where log_value_index > 1;

The first call to json_array_elements() extracts all array elements from the log attribute. The with ordinality allows us to identify each row in that array. The second call then gets each element from the lines, again the with ordinality allows us to find out the position in the array.

The above query returns this:

log_line | log_value | timestamp            
---------+-----------+----------------------
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.0"     | "2016-10-12T20:33:21"
       1 | "24.00"   | "2016-10-12T20:33:21"
       1 | "1.83"    | "2016-10-12T20:33:21"
       1 | "-0.00"   | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "-100.00" | "2016-10-12T20:33:21"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.0"     | "2016-10-12T20:33:23"
       2 | "24.00"   | "2016-10-12T20:33:23"
       2 | "1.52"    | "2016-10-12T20:33:23"
       2 | "-0.61"   | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "-100.00" | "2016-10-12T20:33:23"

The result of the above statement can then be used to directly insert the data without looping over it. That should be a lot faster then doing a lot individual inserts.

I'm unsure however how you can integrate the correct record_node_id or record_log_id into the above result.

edited Oct 21, 2016 at 10:52

answered Oct 21, 2016 at 9:16

user330315

Sign up to request clarification or add additional context in comments.

2 Comments

Vern Over a year ago

Thank you for the quick answer The data is unnested in a different manner. For each value in the "line array", except for the first value, a row is inserted in log_rlda. Explanation of the columns in log_rlda: record_node_id Refers to a record node, which describes the meaning of the value. log_line The line number in the log data log_value each value timestamp the first value in the "line array" record_log_id refers to the set of data I answerd your first question as to what "s_insert_query_values" contains.

Vern Over a year ago

to solve the issue of integrating the correct record_node_id, i have editted the following statement to select the right ID from an array of ID's i gathered before. I use the log_value_index, to iterate through it. It works pretty good and i have gained a significant performance boost so far. I added the solution to the topic

Collectives™ on Stack Overflow

Insert Bulk of data to Postgresql

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related