Memory consumption of parallel processing in python when access a database

Question

I have a fairly huge table (180 million records) in SQL Server database. Something like below:

my_table>> columns: Date, Value1, Value2, Valeu3

I also have a python script which run concurrently with pool.map() and in each child process(iteration), a connection is made to access my_table and fetch a slice of it with below script and do other computations:

select * from my_table where Date is between a1 and a2

My question is when the python script run in parallel, does each child process load the whole SQL table data (180 million rows) in memory and then slice it based on where condition?

If that is the case, each child process would have to load 180 million rows into memory and that would freeze everything.

I am pretty sure that if I query a huge table in SQL Server a couple of times, the whole data would load into memory by SQL Server just once for the first query and other queries would use the data which had loaded into RAM by the first query.

David Browne - Microsoft · Accepted Answer · 2020-04-26 17:32:45Z

In SQL Server queries always read data from the in-memory page cache. If a query plan needs rows on a page not currently in the page cache, the buffer manager puts the query into a PAGEIOLATCH wait and fetches the page into memory.

If multiple processes send a query like

select * from my_table where Date is between a1 and a2

Each query may need to read all the rows to apply the filter (that depends on the indexes), but they will all read the same pages from memory, to the extent that the table fits in memory.

You can massively increase how much of the table fits in memory by storing it with page compression (~3x compression), or as a clustered columnstore (~10x compression).

And you can estimate the compression with sp_estimate_data_compression_savings.

Note that all compression styles improve server-side query processing, but also increase the cost of moving rows from the server to the client, as query plans can read the compressed data, but it must be uncompressed to send over the network. So if you're fetching it all to the client, it may not be worthwhile.

Also SQL Server 2017 and later have an optional component SQL Server Machine Learning Services that allows you to run your Python code on the server, with super-fast access to the data.

Collectives™ on Stack Overflow

Memory consumption of parallel processing in python when access a database

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related