0

I have a table with partitioning by day. This table stores data for one month and there are about 3 billion of them, but many partitions in the table are empty.

How can I optimally select exactly 5000 random records from the depth of the entire table?

I'm thinking about TABLESAMPLE SYSTEM (0.1) with limit, but if empty partitions are included in the sample, then there are fewer than 5000 records. The query select *, row_number() over (order by random()) takes a long time to execute and loads the cpu.

3
  • You can consider several strategies: 1) Use an improved TABLESAMPLE SYSTEM method within a loop to accumulate the desired number of records. 2) Sample a few rows from each non-empty partition and union the results, ensuring a more uniform distribution. 3) Employ a random offset method, selecting rows based on a list of unique random numbers, though this can be CPU-intensive. 4) Combine TABLESAMPLE with a random offset approach to first narrow down the dataset and then apply randomness, potentially offering a balance between efficiency and randomness. Commented Jan 23, 2024 at 21:37
  • Try other table sample methods. Commented Jan 24, 2024 at 8:21
  • It depends what exactly you call random records. There is an easy way to get exactly N random records quickly, so long as said records are stored consecutively in a table (rather, in segments of the table). Does it count as random to you. I am asking because the records will not look chosen randomly if e.g. records are inserted in your table with CURRENT_TIMESTAMP or a SERIAL in one of the columns. If you want to get random records from random segments, I am afraid it will inherently be slow anyway from reading the whole table from the disk. Commented Jan 30, 2024 at 15:07

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.