0

I have a piece of code that works very well in every way but one. It takes too long to execute. My suspicion is that the long execution time is caused by an overuse of JOINs (11 to be precise). Here is what I am trying to do:

SETUP I have 3 tables:

  1. v_na_measurement_status
id.           start_measurement.                  stop_measurement.

0       2020-09-01 22:19:48.480668+00:00    2020-09-01 22:29:29.713952+00:00
1       2020-09-01 22:17:10.392089+00:00    2020-09-01 22:18:08.139668+00:00
2       2020-09-01 22:12:49.795404+00:00    2020-09-01 22:15:14.564597+00:00 
...
  1. sensor_double_precision
id.         sensor_name.                  timestamps.                 value_cal
40548   "curved_mirror_steps"   "2020-09-01 22:29:23.526468+00"     432131  
40547   "na_average_enable".    "2020-09-01 22:29:23.410416+00"     1   
40546   "na_averages"           "2020-09-01 22:29:23.295404+00".    16  
40545   "na_power"              "2020-09-01 22:29:23.174255+00"     -5  
40544   "na_stop_freq"          "2020-09-01 22:29:23.05868+00"      18000000000 
40543   "na_start_freq"         "2020-09-01 22:29:22.944205+00"     15000000000 
...
  1. sensor_double_precision_arr
id.        sensor_name.                        timestamp                     value_cal
3831    "na_s11_iq_data_trace2"     "2020-09-01 22:29:27.456345+00".   [array with ~2000 points]
3830    "na_s21_iq_data"            "2020-09-01 22:29:27.389617+00".   [array with ~2000 points]
3829    "na_s11_iq_data_trace2"     "2020-09-01 22:29:20.543466+00".   [array with ~2000 points]
3828    "na_s21_iq_data"            "2020-09-01 22:29:20.443416+00".   [array with ~2000 points]

GOAL Using these 3 tables I want to create a VIEW called v_na_log that looks like this:

  start_measurement                  stop_measurement            curved_mirror_steps.  na_averages   na_start_freq      na_stop_freq.     na_s21_iq_data.               na_s11_iq_data   
"2020-09-01 22:29:22.913366+00"   "2020-09-01 22:29:27.478287+00"    432131                16         15000000000.      18000000000      [array with ~2000 points].     [array with ~2000 points]
...
...

Basically, I want to transpose sensor_double_precision_arr and sensor_double_precision such that the rows in the sensor_name column become columns themselves.

ONE SOLUTION This is the long code I use to achieve this:

DROP VIEW IF EXISTS v_na_log;
      CREATE VIEW v_na_log AS
        SELECT
          final_view.id, vnms.start_measurement,
          vnms.stop_measurement, final_view.curved_mirror_steps,
          final_view.na_averages, final_view.na_start_freq,
          final_view.na_stop_freq,iq_data.freq_resolution,
          iq_data.na_s21_iq_data, iq_data.na_s11_iq_data
        FROM crosstab(
          'WITH sorted_data AS (WITH current_data AS
          (SELECT DISTINCT ON (id,sensor_name)
             id, sensor_name, value_cal, timestamp
             FROM sensor_double_precision
             WHERE sensor_name in (''na_start_freq'', ''na_stop_freq'',
                                  ''na_averages'',''curved_mirror_steps'')
          ORDER BY id, sensor_name, timestamp ASC)
        SELECT
          m.id,
          s.sensor_name,
          s.value_cal
          FROM v_na_measurement_status m
        INNER JOIN current_data s ON s.timestamp
          BETWEEN m.start_measurement AND m.stop_measurement),
        log_ids AS (SELECT distinct id FROM sorted_data),
        sensor_names AS (SELECT distinct sensor_name FROM sorted_data)
        SELECT log_ids.id, sensor_names.sensor_name, sorted_data.value_cal
          FROM log_ids CROSS JOIN sensor_names
        LEFT JOIN sorted_data on (log_ids.id= sorted_data.id and
                                  sensor_names.sensor_name=sorted_data.sensor_name)')
        final_view(id bigint, curved_mirror_steps double precision,
                    na_averages double precision, na_start_freq double precision,
                    na_stop_freq double precision)
        LEFT JOIN v_na_measurement_status vnms on vnms.id = final_view.id
        LEFT JOIN
        (SELECT final_view.id, vnms.start_measurement, vnms.stop_measurement,
          array_length(final_view.na_s21_iq_data, 1) AS freq_resolution,
          final_view.na_s11_iq_data, final_view.na_s21_iq_data
          FROM crosstab(
            'WITH sorted_data AS (with current_data AS
            (SELECT DISTINCT ON (id,sensor_name)
              id, sensor_name, value_cal, timestamp
              FROM sensor_double_precision_arr
                WHERE  sensor_name LIKE ''%na_s21_iq_data%'' OR sensor_name LIKE ''%na_s11_iq_data%''
              ORDER BY id, sensor_name, timestamp ASC)
        SELECT
          m.id,
          s.sensor_name,
          s.value_cal
          FROM v_na_measurement_status m
        INNER JOIN current_data s ON s.timestamp
          BETWEEN m.start_measurement AND m.stop_measurement),
        log_ids AS (SELECT distinct id FROM sorted_data),
        sensor_names AS (SELECT distinct sensor_name FROM sorted_data)
        SELECT log_ids.id, sensor_names.sensor_name, sorted_data.value_cal
          FROM log_ids CROSS JOIN sensor_names
        LEFT JOIN sorted_data ON (log_ids.id= sorted_data.id AND sensor_names.sensor_name = sorted_data.sensor_name)')
        final_view(id bigint, na_s11_iq_data double precision[], na_s21_iq_data double precision[])
        LEFT JOIN v_na_measurement_status vnms ON vnms.id = final_view.id) iq_data ON iq_data.id = final_view.id
        ORDER BY vnms.start_measurement ASC;

PROBLEM I use this code to do it. It works very well and so far I haven't run into issues even on edge cases. But, it takes an annoyingly long time to run. For example, if I run: SELECT * FROM v_na_log LIMIT 10 It takes about 13 seconds to execute. It takes even longer if I remove the LIMIT clause. I usually have to deal with more than 10 rows so the longer it takes, the worse it is for me to analyze the data. Like I said I think this has to with the use of so many JOINs. However, I do not see a better way of doing this. I want to know if there is a better solution to this because I have a suspicion that this problem will keep getting worse and worse as the tables get larger.

I have posted about simplified versions other similar tasks and have received invaluable suggestions. I have incorporated those suggestions in my code. Here's the link for a question I had that relates to this: Crosstab using data from 2 different tables

Any suggestions on a better organization of the tables in the database itself are welcome as well.

4
  • Are the cross tabs sensor_name columns a handful of values you know in advance? If not, why do you need such a wide table of potentially hundreds or thousands of columns? Are you analyzing with application layer (i.e., Python, R)? Commented Sep 2, 2020 at 21:22
  • Yes, they are a handful of sensor_names that I know in advance. There are two purposes of this VIEW. The main reason is that I want to create plots in a web app called Grafana and this VIEW will make that possible. The secondary reason (not as important) is to use fewer calls to the DB using python while doing data analysis Commented Sep 2, 2020 at 21:57
  • Impossible to answer. Try out the parts that make up the view and see where the slowdown happens. Commented Sep 3, 2020 at 3:20
  • Hi, I have actually tried it out. probably should have mentioned it. Sorry! the each individual process is pretty slow. The first Crosstab takes 2.5s and the second Crosstab takes 9-10 seconds (both with limit 10 clause). I am not sure why the second Crosstab takes so much longer given its doing the exact same thing but with different tables. my guess is that the large arrays might have something to do with it but I do not know why. Commented Sep 3, 2020 at 4:00

1 Answer 1

1

Based on posted data and desired output and given pivoted values are known in advance, consider running conditional aggregation or with Postgres' selective aggregates with FILTER().

SELECT m.start_measurement
       , m.stop_measurement
       , MAX(p.value_cal) FILTER(WHERE p.sensor_name = 'curved_mirror_steps') AS curved_mirror_steps
       , ...
       , MAX(pa.value_cal) FILTER(WHERE pa.sensor_name = 'na_s11_iq_data') AS na_s11_iq_data
      
FROM v_na_measurement_status m 
LEFT JOIN sensor_double_precision p 
  ON p.timestamp BETWEEN m.start_measurement 
                     AND m.stop_measurement
LEFT JOIN sensor_double_precision_arr pa
  ON pa.timestamp BETWEEN m.start_measurement 
                     AND m.stop_measurement
GROUP BY m.start_measurement
       , m.stop_measurement
Sign up to request clarification or add additional context in comments.

2 Comments

Hey thanks! Although this didn't solve the issue of the queries taking a long time to run, it is definitely a lot cleaner than my code. For some reason it takes slightly longer to execute.
Interesting. I suspect it is the time stamp range JOIN matches and maybe the array values with ~2k points in each row. What happens with dropping the array table? Using INNER JOIN? Play around and see. Also, review EXPLAIN plans. Consider researching further on timestamp indexing like this DBA post.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.