Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: cf/5319~1
Choose a base ref
...
head repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cf/5319
Choose a head ref
  • 5 commits
  • 62 files changed
  • 3 contributors

Commits on Nov 22, 2025

  1. Add a view to read contents of shared buffer lookup table

    The view exposes the contents of the shared buffer lookup table for
    debugging, testing and investigation.
    
    This helped me in debugging issues where the buffer descriptor array and
    buffer lookup table were out of sync; either the buffer lookup table had
    a mapping page->buffer which wasn't present in the buffer descriptor
    array or a page in the buffer descriptor array didn't have corresponding
    entry in the buffer lookup table. pg_buffercache doesn't help with those
    kind of issues. Also doing that under the debugger in very painful.
    
    I intend to keep this patch while the rest of the code matures. If it is
    found useful as a debugging tool, we may consider make it committable
    and commit it.
    
    Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
    ashutosh-bapat authored and Commitfest Bot committed Nov 22, 2025
    Configuration menu
    Copy the full SHA
    81da8b3 View commit details
    Browse the repository at this point in the history
  2. Memory and address space management for buffer resizing

    This has three changes
    
    1. Allow to use multiple shared memory mappings
    ============================================
    
    Currently all the work with shared memory is done via a single anonymous
    memory mapping, which limits ways how the shared memory could be organized.
    
    Introduce possibility to allocate multiple shared memory mappings, where
    a single mapping is associated with a specified shared memory segment.
    A new shared memory API is introduced, extended with a segment as a new
    parameter. As a path of least resistance, the original API is kept in
    place, utilizing the main shared memory segment.
    
    Modifies pg_shmem_allocations to report shared memory segment as well.
    Adds pg_shmem_segments to report shared memory segment information.
    
    2. Address space reservation for shared memory
    ============================================
    
    Currently the shared memory layout is designed to pack everything tight
    together, leaving no space between mappings for resizing. Here is how it
    looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    anonymous shared memory we talk about:
    
        00400000-00490000         /path/bin/postgres
        ...
        012d9000-0133e000         [heap]
        7f443a800000-7f470a800000 /dev/zero (deleted)
        7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
        7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
        ...
    
    Make the layout more dynamic via splitting every shared memory segment
    into two parts:
    
    * An anonymous file, which actually contains shared memory content.
      Such an anonymous file is created via memfd_create, it lives in
      memory, behaves like a regular file and semantically equivalent to an
      anonymous memory allocated via mmap with MAP_ANONYMOUS.
    
    * A reservation mapping, which size is much larger than required shared
      segment size. This mapping is created with flag MAP_NORESERVE (to not
      count the reserved space against memory limits). The anonymous file is
      mapped into this reservation mapping.
    
    If we have to change the address maps while resizing the shared buffer
    pool, it is needed to be done in Postmaster too, so that the new
    backends will inherit the resized address space from the Postmaster.
    However, Postmaster is not invovled in ProcSignalBarrier mechanism and
    we don't want it to spend time in things other than its core
    functionality. To achive that, maximum required address space maps are
    setup upfront with read and write access when starting the server. When
    resizing the buffer pool only the backing file object is resized from
    the coordinator. This also makes the ProcSignalBarrier handling code
    light for backends other than the coordinator.
    
    The resulting layout looks like this:
    
        00400000-00490000         /path/bin/postgres
        ...
        3f526000-3f590000 rw-p 		[heap]
        7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
        7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
        7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
        7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
    
    To resize a shared memory segment in this layout it's possible to use
    ftruncate on the memory mapped file.
    
    This approach also do not impact the actual memory usage as reported by
    the kernel.
    
    TODO: Verify that Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
    was created with the memory limit 256 MB, then PostgreSQL was launched within
    this cgroup with shared_buffers = 128 MB:
    
        $ cd /sys/fs/cgroup
        $ mkdir postgres
        $ cd postres
        $ echo 268435456 > memory.max
    
        $ echo $MASTER_PID_SHELL > cgroup.procs
        # postgres from the master branch has being successfully launched
        #  from that shell
        $ cat memory.current
        17465344 (~16.6 MB)
        # stop postgres
    
        $ echo $PATCH_PID_SHELL > cgroup.procs
        # postgres from the patch has being successfully launched from that shell
        $ cat memory.current
        20770816 (~19.8 MB)
    
    There are also few unrelated advantages of using memory mapped files:
    
    * We've got a file descriptor, which could be used for regular file
      operations (modification, truncation, you name it).
    
    * The file could be given a name, which improves readability when it
      comes to process maps.
    
    * By default, Linux will not add file-backed shared mappings into a core dump,
      making it more convenient to work with them in PostgreSQL: no more huge dumps
      to process. - Some hackers have expressed concerns over it.
    
    The downside is that memfd_create is Linux specific.
    
    3. Refactor CalculateShmemSize()
    =============================
    
    This function calls many functions which return the amount of shared
    memory required for different shared memory data structures. Up until
    now, the returned total of these sizes was used to create a single
    shared memory segment. With this change, CalculateShmemSize() needs to
    estimate memory requirements for each of the segments. It now takes an
    array of MemoryMappingSizes, containing as many elements as the number
    of segments, as an argument. The sizes returned by all the function it
    calls, except BufferManagerShmemSize(), are added and saved in the first
    element (index 0) of the array.  BufferManagerShmemSize() is modified to
    save the amount of memory required for buffer manager related segments
    in the corresponding array element. Additionally it also saves the
    amount of reserved space. For now, the amount of reserved address space
    is same as the amount of required memory but that is expected to change
    with the next commit which implements buffer pool resize.
    CalculateShmemSize() now returns the total of sizes corresponding to all
    the sizes.
    
    Author: Dmitrii Dolgov and Ashutosh Bapat
    Reviewed-by: Tomas Vondra
    erthalion authored and Commitfest Bot committed Nov 22, 2025
    Configuration menu
    Copy the full SHA
    a12248f View commit details
    Browse the repository at this point in the history
  3. Allow to resize shared memory without restart

    shared_buffers is now PGC_SIGHUP instead of PGC_POSTMASTER.  The value
    of this GUC is saved in NBuffersPending instead of NBuffers. When the
    server starts, the shared memory size is estimated and the memory is
    allocated using NBuffersPending.
    
    When a server is running, the new value of GUC (set using ALTER SYSTEM
    ... SET shared_buffers = ...; followed by SELECT pg_reload_conf()) does
    not come into effect immediately. Instead a function
    pg_resize_shared_buffers() is used to resize the buffer pool. The
    function uses the current value of GUC in the backends where it is
    executed. The function also coordinates the buffer resizing
    synchronization across backends.
    
    SHOW shared_buffers now shows the current size of the shared buffer pool
    but it also shows pending size of shared buffers, if any.
    
    A new GUC max_shared_buffers is introduced to control the maximum value
    of shared_buffers that can be set. By default it is 0. When explicitly
    set it needs to be higher than 'shared_buffers'. When max_shared_buffers
    is set to 0, it assumes the same value as GUC shared_buffers. This GUC
    determines the size of address space reserved for future buffer pool
    sizes and the size of buffer look up table.
    
    TBD: Describe the protocol used by pg_resize_shared_buffers() to
    synchronize buffer resizing operation with other backends.
    
    When shrinking the shared buffers pool, each buffer in the area being
    shrunk needs to be flushed if it's dirty so as not to loose the changes
    to that buffer after shrinking. Also, each such buffer needs to be
    removed from the buffer mapping table so that backends do not access it
    after shrinking.
    
    If a buffer being evicted is pinned, we abort the resizing operation.
    There are other alternative which are not implemented in the current
    patches 1. to wait for the pinned buffer to get unpinned, 2. the backend
    is killed or it itself cancels the query  or 3.  rollback the operation.
    Note that option 1 and 2 would require the pinning related local and
    shared records to be accessed. But we need infrastructure to do either
    of this right now.
    
    So far the buffer pool metdata (NBuffers and the shared memory segment
    address space) is saved in process local heap memory since it's static
    for the life of a server. It is passed to a new backend through
    Postmaster. But with buffer pool being resized while the server running,
    we need Postmaster to update its buffer pool metadata as the resizing
    progresses and pass it to the new backend. This has few complications:
    
    1. Postmaster does not receive ProcSignalBarrier. So we need to signal
       it separately.
    
    2. Postmaster's local state is inherited by the new backend when
       fork()ed. But we need more complex implementation to pass it to an
       exec()ed backend.
    
    3. A new backend may receive the updated state from Postmaster and also
       the signal barrier which prompts the same update. Thus the proc
       signal barrier code needs to be idempotent; adding further complexity
       to it.
    
    4. This task takes away Postmaster resources from it's core
       functionality.
    
    This can be avoided by following two changes:
    1. The shared memory is resized only in a single backend without
       requiring any changes to the memory address space.
    2. Maintaining the buffer pool metadata in the shared memory instead of
       process local memory. This change may affect performance so verify
       that performance is not degraded.
    
    TODO: In case the backend executing pg_resize_shared_buffers() exits
    before the operation finishes, we need to make sure that the changes
    made to the shared memory while resizing are cleaned up properly.
    
    Removing the evicted buffers from buffer ring
    =============================================
    If the buffer pool has been shrunk, the buffers in the buffer ring may
    not be valid anymore. Modify GetBufferFromRing to check if the buffer is
    still valid before using it. This makes GetBufferFromRing() a bit more
    expensive because of additional boolean condition and masks any bug that
    introduces an invalid buffer into the ring. The alternative fix is more
    complex as explained below.
    
    The strategy object is created in CurrentMemoryContext and is not
    available in any global structure thus accessible when processing buffer
    resizing barriers. We may modify GetAccessStrategy() to register
    strategy in a global linked list and then arrange to deregister it once
    it's no more in use. Looking at the places which use
    GetAccessStrategy(), fixing all those may be some work.
    
    Author: Ashutosh Bapat
    Author: Dmitrii Dolgov
    Author of some tests: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
    Reviewed-by: Tomas Vondra
    
    More detailed note follow: Need to see which of those fit in the commit
    message and which should be removed.
    
    Reinitializing strategry control area
    =====================================
    The commit introduces a separate function StrategyReInitialize() instead
    of reusing StrategyInitialize() since some of the things that the second
    one does are not required in the first one. Here's list of what
    StrategyReInitialize() does and how does it differ from
    StrategyInitialize().
    
    1. StrategyControl pointer needn't be fetched again since it should not
       change. But added an Assert to make sure the pointer is valid.
    2. &StrategyControl->buffer_strategy_lock need not be initialized again.
    3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
       the context of NBuffers. Now that NBuffers itself has changed, those
       three do not make sense. Reset them as if the server has restarted
       again.
    
    Ability to delay resizing operation
    ===================================
    This commit introduces a flag delay_shmem_resize, which postgresql
    backends and workers can use to signal the coordinator to delay resizing
    operation. Background writer sets this flag when its scanning buffers.
    
    Background writer operation (needs a rethink)
    ===========================
    Background writer is blocked when the actual resizing is in progress. It
    stops a scan in progress when it sees that the resizing has begun or is
    about to begin. Once the buffer resizing is finished, before resuming
    the regular operation, bgwriter resets the information saved so far.
    This information is viewed in the context of NBuffers and hence does not
    make sense after resizing which chanegs NBuffers.
    
    Buffer lookup table
    ===================
    Right now there is no way to free shared memory. Even if we shrink the
    buffer lookup table when shrinking the buffer pool the unused hash table
    entries can not be freed. When we expand the buffer pool, more entries
    can be allocated but we can not resize the hash table directory without
    rehashing all the entries. Just allocating more entries will lead to
    more contention. Hence we setup the buffer lookup table considering the
    maximum possible size of the buffer pool which is MaxAvailableMemory
    only once at the beginning.  Shared buffer lookup table and
    StrategyControl are not resized even if the buffer pool is resized hence
    they are allocated in the main shared memory segment
    
    BgWriter refactoring
    ====================
    
    The way BgBufferSync is written today, it packs four functionalities:
    setting up the buffer sync state, performing the buffer sync, resetting
    the buffer sync state when bgwriter_lru_maxpages <= 0 and setting it up
    again after bgwriter_lru_maxpages > 0. That makes the code hard to read.
    It will be good to divide this function into 3/4 different functions
    each performing one functionality. Then pack all the state (the local
    variables from that function converted to static global) into a
    structure, which is passed to these functions. Once that happens
    BgBufferSyncReset() will call one of the functions to reset the state
    when buffer pool is resized.
    erthalion authored and Commitfest Bot committed Nov 22, 2025
    Configuration menu
    Copy the full SHA
    b7678e1 View commit details
    Browse the repository at this point in the history
  4. WIP test shared buffers resizing and checkpoint

    A new test triggers an injection point in the BufferSync() after it has
    collected buffers to flushed. Simultaneously it starts buffer shrinking. The
    expectation is that the checkpointer would crash accessing a buffer (descriptor)
    outside the new range of shared buffers. But that does not happen because of a
    bug in synchronization. The checkpointer does not reload configuration when
    checkpoint is going on. It does not load the new value of the configuration.
    When the resizing is triggered by the PM, checkpointer receives the proc signal
    barrier but it does not start it doesn't enter the barrier mechanism and doesn't
    alter its address maps or memory sizes. Hence the test does not crash. But of
    course it means that it won't consider the correct size of buffers next time it
    performs a checkpoint.
    
    The test was at least useful to detect this anomaly. Once we fix the
    synchronization issue we should see the crash and then fix the crash.
    
    Author: Ashutosh Bapat
    
    Notes to reviewers
    ------------------
    
    1. pg_buffercache used a query on pg_settings to fetch the value of the
       number of buffers. That doesn't work anymore because of change in the
       SHOW shared_buffers. Modified the test to convert the setting value
       to the number of shared buffers, save it in a variable and use the
       variable in queries which need the number of shared buffers. We could
       instead fix ShowGUCOption() to pass use_units flag to show_hook and
       let it output the number of shared buffers instead.  But that seems a
       larger change. There aren't other GUCs whose show_hook outputs their
       values with units. So this local fix might be better.
    ashutosh-bapat authored and Commitfest Bot committed Nov 22, 2025
    Configuration menu
    Copy the full SHA
    18d355b View commit details
    Browse the repository at this point in the history
  5. [CF 5319] Changing shared_buffers without restart

    This branch was automatically generated by a robot using patches from an
    email thread registered at:
    
    https://commitfest.postgresql.org/patch/5319
    
    The branch will be overwritten each time a new patch version is posted to
    the thread, and also periodically to check for bitrot caused by changes
    on the master branch.
    
    Patch(es): https://www.postgresql.org/message-id/CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J+eQ@mail.gmail.com
    Author(s): Dmitry Dolgov
    Commitfest Bot committed Nov 22, 2025
    Configuration menu
    Copy the full SHA
    c0e907c View commit details
    Browse the repository at this point in the history
Loading