Comparing changes

The view exposes the contents of the shared buffer lookup table for debugging, testing and investigation. This helped me in debugging issues where the buffer descriptor array and buffer lookup table were out of sync; either the buffer lookup table had a mapping page->buffer which wasn't present in the buffer descriptor array or a page in the buffer descriptor array didn't have corresponding entry in the buffer lookup table. pg_buffercache doesn't help with those kind of issues. Also doing that under the debugger in very painful. I intend to keep this patch while the rest of the code matures. If it is found useful as a debugging tool, we may consider make it committable and commit it. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>

This has three changes 1. Allow to use multiple shared memory mappings ============================================ Currently all the work with shared memory is done via a single anonymous memory mapping, which limits ways how the shared memory could be organized. Introduce possibility to allocate multiple shared memory mappings, where a single mapping is associated with a specified shared memory segment. A new shared memory API is introduced, extended with a segment as a new parameter. As a path of least resistance, the original API is kept in place, utilizing the main shared memory segment. Modifies pg_shmem_allocations to report shared memory segment as well. Adds pg_shmem_segments to report shared memory segment information. 2. Address space reservation for shared memory ============================================ Currently the shared memory layout is designed to pack everything tight together, leaving no space between mappings for resizing. Here is how it looks like for one mapping in /proc/$PID/maps, /dev/zero represents the anonymous shared memory we talk about: 00400000-00490000 /path/bin/postgres ... 012d9000-0133e000 [heap] 7f443a800000-7f470a800000 /dev/zero (deleted) 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 ... Make the layout more dynamic via splitting every shared memory segment into two parts: * An anonymous file, which actually contains shared memory content. Such an anonymous file is created via memfd_create, it lives in memory, behaves like a regular file and semantically equivalent to an anonymous memory allocated via mmap with MAP_ANONYMOUS. * A reservation mapping, which size is much larger than required shared segment size. This mapping is created with flag MAP_NORESERVE (to not count the reserved space against memory limits). The anonymous file is mapped into this reservation mapping. If we have to change the address maps while resizing the shared buffer pool, it is needed to be done in Postmaster too, so that the new backends will inherit the resized address space from the Postmaster. However, Postmaster is not invovled in ProcSignalBarrier mechanism and we don't want it to spend time in things other than its core functionality. To achive that, maximum required address space maps are setup upfront with read and write access when starting the server. When resizing the buffer pool only the backing file object is resized from the coordinator. This also makes the ProcSignalBarrier handling code light for backends other than the coordinator. The resulting layout looks like this: 00400000-00490000 /path/bin/postgres ... 3f526000-3f590000 rw-p [heap] 7fbd827fe000-7fbd8bdde000 rw-s /memfd:main (deleted) -- anon file 7fbd8bdde000-7fbe82800000 ---s /memfd:main (deleted) -- reservation 7fbe82800000-7fbe90670000 r--p /usr/lib/locale/locale-archive 7fbe90800000-7fbe90941000 r-xp /usr/lib64/libstdc++.so.6.0.34 To resize a shared memory segment in this layout it's possible to use ftruncate on the memory mapped file. This approach also do not impact the actual memory usage as reported by the kernel. TODO: Verify that Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup was created with the memory limit 256 MB, then PostgreSQL was launched within this cgroup with shared_buffers = 128 MB: $ cd /sys/fs/cgroup $ mkdir postgres $ cd postres $ echo 268435456 > memory.max $ echo $MASTER_PID_SHELL > cgroup.procs # postgres from the master branch has being successfully launched # from that shell $ cat memory.current 17465344 (~16.6 MB) # stop postgres $ echo $PATCH_PID_SHELL > cgroup.procs # postgres from the patch has being successfully launched from that shell $ cat memory.current 20770816 (~19.8 MB) There are also few unrelated advantages of using memory mapped files: * We've got a file descriptor, which could be used for regular file operations (modification, truncation, you name it). * The file could be given a name, which improves readability when it comes to process maps. * By default, Linux will not add file-backed shared mappings into a core dump, making it more convenient to work with them in PostgreSQL: no more huge dumps to process. - Some hackers have expressed concerns over it. The downside is that memfd_create is Linux specific. 3. Refactor CalculateShmemSize() ============================= This function calls many functions which return the amount of shared memory required for different shared memory data structures. Up until now, the returned total of these sizes was used to create a single shared memory segment. With this change, CalculateShmemSize() needs to estimate memory requirements for each of the segments. It now takes an array of MemoryMappingSizes, containing as many elements as the number of segments, as an argument. The sizes returned by all the function it calls, except BufferManagerShmemSize(), are added and saved in the first element (index 0) of the array. BufferManagerShmemSize() is modified to save the amount of memory required for buffer manager related segments in the corresponding array element. Additionally it also saves the amount of reserved space. For now, the amount of reserved address space is same as the amount of required memory but that is expected to change with the next commit which implements buffer pool resize. CalculateShmemSize() now returns the total of sizes corresponding to all the sizes. Author: Dmitrii Dolgov and Ashutosh Bapat Reviewed-by: Tomas Vondra

shared_buffers is now PGC_SIGHUP instead of PGC_POSTMASTER. The value of this GUC is saved in NBuffersPending instead of NBuffers. When the server starts, the shared memory size is estimated and the memory is allocated using NBuffersPending. When a server is running, the new value of GUC (set using ALTER SYSTEM ... SET shared_buffers = ...; followed by SELECT pg_reload_conf()) does not come into effect immediately. Instead a function pg_resize_shared_buffers() is used to resize the buffer pool. The function uses the current value of GUC in the backends where it is executed. The function also coordinates the buffer resizing synchronization across backends. SHOW shared_buffers now shows the current size of the shared buffer pool but it also shows pending size of shared buffers, if any. A new GUC max_shared_buffers is introduced to control the maximum value of shared_buffers that can be set. By default it is 0. When explicitly set it needs to be higher than 'shared_buffers'. When max_shared_buffers is set to 0, it assumes the same value as GUC shared_buffers. This GUC determines the size of address space reserved for future buffer pool sizes and the size of buffer look up table. TBD: Describe the protocol used by pg_resize_shared_buffers() to synchronize buffer resizing operation with other backends. When shrinking the shared buffers pool, each buffer in the area being shrunk needs to be flushed if it's dirty so as not to loose the changes to that buffer after shrinking. Also, each such buffer needs to be removed from the buffer mapping table so that backends do not access it after shrinking. If a buffer being evicted is pinned, we abort the resizing operation. There are other alternative which are not implemented in the current patches 1. to wait for the pinned buffer to get unpinned, 2. the backend is killed or it itself cancels the query or 3. rollback the operation. Note that option 1 and 2 would require the pinning related local and shared records to be accessed. But we need infrastructure to do either of this right now. So far the buffer pool metdata (NBuffers and the shared memory segment address space) is saved in process local heap memory since it's static for the life of a server. It is passed to a new backend through Postmaster. But with buffer pool being resized while the server running, we need Postmaster to update its buffer pool metadata as the resizing progresses and pass it to the new backend. This has few complications: 1. Postmaster does not receive ProcSignalBarrier. So we need to signal it separately. 2. Postmaster's local state is inherited by the new backend when fork()ed. But we need more complex implementation to pass it to an exec()ed backend. 3. A new backend may receive the updated state from Postmaster and also the signal barrier which prompts the same update. Thus the proc signal barrier code needs to be idempotent; adding further complexity to it. 4. This task takes away Postmaster resources from it's core functionality. This can be avoided by following two changes: 1. The shared memory is resized only in a single backend without requiring any changes to the memory address space. 2. Maintaining the buffer pool metadata in the shared memory instead of process local memory. This change may affect performance so verify that performance is not degraded. TODO: In case the backend executing pg_resize_shared_buffers() exits before the operation finishes, we need to make sure that the changes made to the shared memory while resizing are cleaned up properly. Removing the evicted buffers from buffer ring ============================================= If the buffer pool has been shrunk, the buffers in the buffer ring may not be valid anymore. Modify GetBufferFromRing to check if the buffer is still valid before using it. This makes GetBufferFromRing() a bit more expensive because of additional boolean condition and masks any bug that introduces an invalid buffer into the ring. The alternative fix is more complex as explained below. The strategy object is created in CurrentMemoryContext and is not available in any global structure thus accessible when processing buffer resizing barriers. We may modify GetAccessStrategy() to register strategy in a global linked list and then arrange to deregister it once it's no more in use. Looking at the places which use GetAccessStrategy(), fixing all those may be some work. Author: Ashutosh Bapat Author: Dmitrii Dolgov Author of some tests: Palak Chaturvedi <chaturvedipalak1911@gmail.com> Reviewed-by: Tomas Vondra More detailed note follow: Need to see which of those fit in the commit message and which should be removed. Reinitializing strategry control area ===================================== The commit introduces a separate function StrategyReInitialize() instead of reusing StrategyInitialize() since some of the things that the second one does are not required in the first one. Here's list of what StrategyReInitialize() does and how does it differ from StrategyInitialize(). 1. StrategyControl pointer needn't be fetched again since it should not change. But added an Assert to make sure the pointer is valid. 2. &StrategyControl->buffer_strategy_lock need not be initialized again. 3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in the context of NBuffers. Now that NBuffers itself has changed, those three do not make sense. Reset them as if the server has restarted again. Ability to delay resizing operation =================================== This commit introduces a flag delay_shmem_resize, which postgresql backends and workers can use to signal the coordinator to delay resizing operation. Background writer sets this flag when its scanning buffers. Background writer operation (needs a rethink) =========================== Background writer is blocked when the actual resizing is in progress. It stops a scan in progress when it sees that the resizing has begun or is about to begin. Once the buffer resizing is finished, before resuming the regular operation, bgwriter resets the information saved so far. This information is viewed in the context of NBuffers and hence does not make sense after resizing which chanegs NBuffers. Buffer lookup table =================== Right now there is no way to free shared memory. Even if we shrink the buffer lookup table when shrinking the buffer pool the unused hash table entries can not be freed. When we expand the buffer pool, more entries can be allocated but we can not resize the hash table directory without rehashing all the entries. Just allocating more entries will lead to more contention. Hence we setup the buffer lookup table considering the maximum possible size of the buffer pool which is MaxAvailableMemory only once at the beginning. Shared buffer lookup table and StrategyControl are not resized even if the buffer pool is resized hence they are allocated in the main shared memory segment BgWriter refactoring ==================== The way BgBufferSync is written today, it packs four functionalities: setting up the buffer sync state, performing the buffer sync, resetting the buffer sync state when bgwriter_lru_maxpages <= 0 and setting it up again after bgwriter_lru_maxpages > 0. That makes the code hard to read. It will be good to divide this function into 3/4 different functions each performing one functionality. Then pack all the state (the local variables from that function converted to static global) into a structure, which is passed to these functions. Once that happens BgBufferSyncReset() will call one of the functions to reset the state when buffer pool is resized.

A new test triggers an injection point in the BufferSync() after it has collected buffers to flushed. Simultaneously it starts buffer shrinking. The expectation is that the checkpointer would crash accessing a buffer (descriptor) outside the new range of shared buffers. But that does not happen because of a bug in synchronization. The checkpointer does not reload configuration when checkpoint is going on. It does not load the new value of the configuration. When the resizing is triggered by the PM, checkpointer receives the proc signal barrier but it does not start it doesn't enter the barrier mechanism and doesn't alter its address maps or memory sizes. Hence the test does not crash. But of course it means that it won't consider the correct size of buffers next time it performs a checkpoint. The test was at least useful to detect this anomaly. Once we fix the synchronization issue we should see the crash and then fix the crash. Author: Ashutosh Bapat Notes to reviewers ------------------ 1. pg_buffercache used a query on pg_settings to fetch the value of the number of buffers. That doesn't work anymore because of change in the SHOW shared_buffers. Modified the test to convert the setting value to the number of shared buffers, save it in a variable and use the variable in queries which need the number of shared buffers. We could instead fix ShowGUCOption() to pass use_units flag to show_hook and let it output the number of shared buffers instead. But that seems a larger change. There aren't other GUCs whose show_hook outputs their values with units. So this local fix might be better.

This branch was automatically generated by a robot using patches from an email thread registered at: https://commitfest.postgresql.org/patch/5319 The branch will be overwritten each time a new patch version is posted to the thread, and also periodically to check for bitrot caused by changes on the master branch. Patch(es): https://www.postgresql.org/message-id/CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J+eQ@mail.gmail.com Author(s): Dmitry Dolgov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Nov 22, 2025

This comparison is taking too long to generate.

Uh oh!