summaryrefslogtreecommitdiff
path: root/src/backend/storage/ipc
AgeCommit message (Collapse)Author
45 hoursAllow passing a pointer to GetNamedDSMSegment()'s init callback.Nathan Bossart
This commit adds a new "void *arg" parameter to GetNamedDSMSegment() that is passed to the initialization callback function. This is useful for reusing an initialization callback function for multiple DSM segments. Author: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com
5 daysReplace most StaticAssertStmt() with StaticAssertDecl()Peter Eisentraut
Similar to commit 75f49221c22, it is preferable to use StaticAssertDecl() instead of StaticAssertStmt() when possible. Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com
8 daysUse palloc_object() and palloc_array() in backend codeMichael Paquier
The idea is to encourage more the use of these new routines across the tree, as these offer stronger type safety guarantees than palloc(). This batch of changes includes most of the trivial changes suggested by the author for src/backend/. A total of 334 files are updated here. Among these files, 48 of them have their build change slightly; these are caused by line number changes as the new allocation formulas are simpler, shaving around 100 lines of code in total. Similar work has been done in 0c3c5c3b06a3 and 31d3847a37be. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
14 daysRename BUFFERPIN wait event class to BUFFERAndres Freund
In an upcoming patch more wait events will be added to the wait event class (for buffer locking), making the current name too specific. Alternatively we could introduce a dedicated wait event class for those, but it seems somewhat confusing to have a BUFFERPIN and a BUFFER wait event class. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-02Show size of DSAs and dshashes in pg_dsm_registry_allocations.Nathan Bossart
Presently, this view reports NULL for the size of DSAs and dshash tables because 1) the current backend might not be attached to them and 2) the registry doesn't save the pointers to the dsa_area or dshash_table in local memory. Also, the view doesn't show partially-initialized entries to avoid ambiguity, since those entries would report a NULL size as well. This commit introduces a function that looks up the size of a DSA given its handle (transiently attaching to the control segment if needed) and teaches pg_dsm_registry_allocations to use it to show the size of successfully-initialized DSA and dshash entries. Furthermore, the view now reports partially-initialized entries with a NULL size. Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aSeEDeznAsHR1_YF%40nathan
2025-12-02Remove useless casting to same typePeter Eisentraut
This removes some casts where the input already has the same type as the type specified by the cast. Their presence could cause risks of hiding actual type mismatches in the future or silently discarding qualifiers. It also improves readability. Same kind of idea as 7f798aca1d5 and ef8fe693606. (This does not change all such instances, but only those hand-picked by the author.) Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-26Teach DSM registry to retry entry initialization if needed.Nathan Bossart
If DSM registry entry initialization fails, backends could try to use an uninitialized DSM segment, DSA, or dshash table (since the entry is still added to the registry). To fix, restructure the code so that the registry retries initialization as needed. This commit also modifies pg_get_dsm_registry_allocations() to leave out partially-initialized entries, as they shouldn't have any allocated memory. DSM registry entry initialization shouldn't fail often in practice, but retrying was deemed better than leaving entries in a permanently failed state (as was done by commit 1165a933aa, which has since been reverted). Suggested-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org Backpatch-through: 17
2025-11-26Revert "Teach DSM registry to ERROR if attaching to an uninitialized entry."Nathan Bossart
This reverts commit 1165a933aa (and the corresponding commits on the back-branches). In a follow-up commit, we'll teach the registry to retry entry initialization instead of leaving it in a permanently failed state. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org Backpatch-through: 17
2025-11-24postgres: Use pg_{add,mul}_size_overflow()Jacob Champion
The backend implementations of add_size() and mul_size() can now make use of the APIs provided in common/int.h. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAOYmi%2B%3D%2BpqUd2MUitvgW1pAJuXgG_TKCVc3_Ek7pe8z9nkf%2BAg%40mail.gmail.com
2025-11-12Teach DSM registry to ERROR if attaching to an uninitialized entry.Nathan Bossart
If DSM entry initialization fails, backends could try to use an uninitialized DSM segment, DSA, or dshash table (since the entry is still added to the registry). To fix, keep track of whether initialization completed, and ERROR if a backend tries to attach to an uninitialized entry. We could instead retry initialization as needed, but that seemed complicated, error prone, and unlikely to help most cases. Furthermore, such problems probably indicate a coding error. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com Backpatch-through: 17
2025-11-06Refactor shared memory allocation for semaphoresHeikki Linnakangas
Before commit e25626677f, spinlocks were implemented using semaphores on some platforms (--disable-spinlocks). That made it necessary to initialize semaphores early, before any spinlocks could be used. Now that we don't support --disable-spinlocks anymore, we can allocate the shared memory needed for semaphores the same way as other shared memory structures. Since the semaphores are used only in the PGPROC array, move the semaphore shmem size estimation and initialization calls to ProcGlobalShmemSize() and InitProcGlobal(). Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
2025-11-06Add comment to explain why PGReserveSemaphores() is called earlyHeikki Linnakangas
Before commit e25626677f, PGReserveSemaphores() had to be called before SpinlockSemaInit() because spinlocks were implemented using semaphores on some platforms (--disable-spinlocks). Add a comment explaining that. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com Backpatch-to: 18
2025-11-05Add infrastructure for efficient LSN waitingAlexander Korotkov
Implement a new facility that allows processes to wait for WAL to reach specific LSNs, both on primary (waiting for flush) and standby (waiting for replay) servers. The implementation uses shared memory with per-backend information organized into pairing heaps, allowing O(1) access to the minimum waited LSN. This enables fast-path checks: after replaying or flushing WAL, the startup process or WAL writer can quickly determine if any waiters need to be awakened. Key components: - New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush() - Separate pairing heaps for replay and flush waiters - WaitLSN lightweight lock for coordinating shared state - Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring This infrastructure can be used by features that need to wait for WAL operations to complete. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-10-27Fix a couple of comments.Nathan Bossart
These were discovered while reviewing Aleksander Alekseev's proposed changes to pgindent. Oversights in commits 393e0d2314 and 25a30bbd42. Discussion: https://postgr.es/m/aP-H6kSsGOxaB21k%40nathan
2025-10-21Fix comment in pg_get_shmem_allocations_numa()Michael Paquier
The comment fixed in this commit described the function as dealing with database blocks, but in reality it processes shared memory allocations. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aH4DDhdiG9Gi0rG7@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 18
2025-10-03Make some use of anonymous unions [DSM registry].Nathan Bossart
Make some use of anonymous unions, which are allowed as of C11, as examples and encouragement for future code, and to test compilers. This commit changes the DSMRegistryEntry struct. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/aNKsDg0fJwqhZdXX%40nathan
2025-09-25Don't include execnodes.h in replication/conflict.hÁlvaro Herrera
... which silently propagates a lot of headers into many places via pgstat.h, as evidenced by the variety of headers that this patch needs to add to seemingly random places. Add a minimum of typedefs to conflict.h to be able to remove execnodes.h, and fix the fallout. Backpatch to 18, where conflict.h first appeared. Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql
2025-09-24Remove PointerIsValid()Peter Eisentraut
This doesn't provide any value over the standard style of checking the pointer directly or comparing against NULL. Also remove related: - AllocPointerIsValid() [unused] - IndexScanIsValid() [had one user] - HeapScanIsValid() [unused] - InvalidRelation [unused] Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(), RelationIsValid for now, to reduce code churn. Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi
2025-09-03Move dynamically-allocated LWLock tranche names to shared memory.Nathan Bossart
There are two ways for shared libraries to allocate their own LWLock tranches. One way is to call RequestNamedLWLockTranche() in a shmem_request_hook, which requires the library to be loaded via shared_preload_libraries. The other way is to call LWLockNewTrancheId(), which is not subject to the same restrictions. However, LWLockNewTrancheId() does require each backend to store the tranche's name in backend-local memory via LWLockRegisterTranche(). This API is a little cumbersome and leads to things like unhelpful pg_stat_activity.wait_event values in backends that haven't loaded the library. This commit moves these LWLock tranche names to shared memory, thus eliminating the need for each backend to call LWLockRegisterTranche(). Instead, the tranche name must be provided to LWLockNewTrancheId(), which immediately makes the name available to all backends. Since the tranche name array is append-only, lookups can ordinarily avoid locking as long as their local copy of the LWLock counter is greater than the requested tranche ID. One downside of this approach is that we now have a hard limit on both the length of tranche names (NAMEDATALEN-1 bytes) and the number of dynamically-allocated tranches (256). Besides a limit of NAMEDATALEN-1 bytes for tranche names registered via RequestNamedLWLockTranche(), no such limits previously existed. We could avoid these new limits by using dynamic shared memory, but the complexity involved didn't seem worth it. We briefly considered making the tranche limit user-configurable but ultimately decided against that, too. Since there is still a lot of time left in the v19 development cycle, it's possible we will revisit this choice. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAA5RZ0vvED3naph8My8Szv6DL4AxOVK3eTPS0qXsaKi%3DbVdW2A%40mail.gmail.com
2025-08-30Prepare DSM registry for upcoming changes to LWLock tranche names.Nathan Bossart
A proposed patch would place a limit of NAMEDATALEN-1 (i.e., 63) bytes on the names of dynamically-allocated LWLock tranches, but GetNamedDSA() and GetNamedDSHash() may register tranches with longer names. This commit lowers the maximum DSM registry entry name length to NAMEDATALEN-1 bytes and modifies GetNamedDSHash() to create only one tranche, thereby allowing us to keep the DSM registry's tranche names below NAMEDATALEN bytes. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aKzIg1JryN1qhNuy%40nathan
2025-08-28Avoid including commands/dbcommands.h in so many placesÁlvaro Herrera
This has been done historically because of get_database_name (which since commit cb98e6fb8fd4 belongs in lsyscache.c/h, so let's move it there) and get_database_oid (which is in the right place, but whose declaration should appear in pg_database.h rather than dbcommands.h). Clean this up. Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h since commit f1fd515b393a, so remove them. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql
2025-08-22Change dynahash.c and hsearch.h to use int64 instead of longMichael Paquier
This code was relying on "long", which is signed 8 bytes everywhere except on Windows where it is 4 bytes, that could potentially expose it to overflows, even if the current uses in the code are fine as far as I know. This code is now able to rely on the same sizeof() variable everywhere, with int64. long was used for sizes, partition counts and entry counts. Some callers of the dynahash.c routines used long declarations, that can be cleaned up to use int64 instead. There was one shortcut based on SIZEOF_LONG, that can be removed. long is entirely removed from dynahash.c and hsearch.h. Similar work was done in b1e5c9fa9ac4. Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz
2025-08-13Make type Datum be 8 bytes wide everywhere.Tom Lane
This patch makes sizeof(Datum) be 8 on all platforms including 32-bit ones. The objective is to allow USE_FLOAT8_BYVAL to be true everywhere, and in consequence to remove a lot of code that is specific to pass-by-reference handling of float8, int8, etc. The code for abbreviated sort keys can be simplified similarly. In this way we can reduce the maintenance effort involved in supporting 32-bit platforms, without going so far as to actually desupport them. Since Datum is strictly an in-memory concept, this has no impact on on-disk storage, though an initdb or pg_upgrade will be needed to fix affected catalog entries. We have required platforms to support [u]int64 for ages, so this breaks no supported platform. We can expect that this change will make 32-bit builds a bit slower and more memory-hungry, although being able to use pass-by-value handling of 8-byte types may buy back some of that. But we stopped optimizing for 32-bit cases a long time ago, and this seems like just another step on that path. This initial patch simply forces the correct type definition and USE_FLOAT8_BYVAL setting, and cleans up a couple of minor compiler complaints that ensued. This is sufficient for testing purposes. In the wake of a bunch of Datum-conversion cleanups by Peter Eisentraut, this now compiles cleanly with gcc on a 32-bit platform. (I'd only tested the previous version with clang, which it turns out is less picky than gcc about width-changing coercions.) There is a good deal of now-dead code that I'll remove in separate follow-up patches. A catversion bump is required because this affects initial catalog contents (on 32-bit machines) in two ways: pg_type.typbyval changes for some built-in types, and Const nodes in stored views/rules will now have 8 bytes not 4 for pass-by-value types. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-08Add missing Datum conversionsPeter Eisentraut
Add various missing conversions from and to Datum. The previous code mostly relied on implicit conversions or its own explicit casts instead of using the correct DatumGet*() or *GetDatum() functions. We think these omissions are harmless. Some actual bugs that were discovered during this process have been committed separately (80c758a2e1d, fd2ab03fea2). Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-07-29Handle cancel requests with PID 0 gracefullyHeikki Linnakangas
If the client sent a query cancel request with backend PID 0, it tripped an assertion. With assertions disabled, you got this in the log instead: LOG: invalid cancel request with PID 0 LOG: wrong key in cancel request for process 0 Query cancellations don't even require authentication, so we better tolerate bogus requests. Fix by turning the assertion into a regular runtime check. Spotted while testing libpq behavior with a modified server that didn't send BackendKeyData to the client. Backpatch-through: 18
2025-07-25Fix assertion failure with latch wait in single-user modeMichael Paquier
LatchWaitSetPostmasterDeathPos, the latch event position for the postmaster death event, is initialized under IsUnderPostmaster. WaitLatch() considered it as a valid wait target in single-user mode (!IsUnderPostmaster), which was incorrect. One code path found to fail with an assertion failure is a database drop in single-user mode while waiting in WaitForProcSignalBarrier() after the drop. Oversight in commit 84e5b2f07a5e. Author: Patrick Stählin <me@packi.ch> Co-authored-by: Ronan Dunklau <ronan.dunklau@aiven.io> Discussion: https://postgr.es/m/18996-3a2744c8140488de@postgresql.org Backpatch-through: 18
2025-07-23Preserve conflict-relevant data during logical replication.Amit Kapila
Logical replication requires reliable conflict detection to maintain data consistency across nodes. To achieve this, we must prevent premature removal of tuples deleted by other origins and their associated commit_ts data by VACUUM, which could otherwise lead to incorrect conflict reporting and resolution. This patch introduces a mechanism to retain deleted tuples on the subscriber during the application of concurrent transactions from remote nodes. Retaining these tuples allows us to correctly ignore concurrent updates to the same tuple. Without this, an UPDATE might be misinterpreted as an INSERT during resolutions due to the absence of the original tuple. Additionally, we ensure that origin metadata is not prematurely removed by vacuum freeze, which is essential for detecting update_origin_differs and delete_origin_differs conflicts. To support this, a new replication slot named pg_conflict_detection is created and maintained by the launcher on the subscriber. Each apply worker tracks its own non-removable transaction ID, which the launcher aggregates to determine the appropriate xmin for the slot, thereby retaining necessary tuples. Conflict information retention (deleted tuples and commit_ts) can be enabled per subscription via the retain_conflict_info option. This is disabled by default to avoid unnecessary overhead for configurations that do not require conflict resolution or logging. During upgrades, if any subscription on the old cluster has retain_conflict_info enabled, a conflict detection slot will be created to protect relevant tuples from deletion when the new cluster starts. This is a foundational work to correctly detect update_deleted conflict which will be done in a follow-up patch. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-07-12Remove long-unused TransactionIdIsActive()Andres Freund
TransactionIdIsActive() has not been used since bb38fb0d43c, in 2014. There are no known uses in extensions either and it's hard to see valid uses for it. Therefore remove TransactionIdIsActive(). Discussion: https://postgr.es/m/odgftbtwp5oq7cxjgf4kjkmyq7ypoftmqy7eqa7w3awnouzot6@hrwnl5tdqrgu
2025-07-09Introduce pg_dsm_registry_allocations view.Nathan Bossart
This commit adds a new system view that provides information about entries in the dynamic shared memory (DSM) registry. Specifically, it returns the name, type, and size of each entry. Note that since we cannot discover the size of dynamic shared memory areas (DSAs) and hash tables backed by DSAs (dshashes) without first attaching to them, the size column is left as NULL for those. Bumps catversion. Author: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Sungwoo Chang <swchangdev@gmail.com> Discussion: https://postgr.es/m/4D445D3E-81C5-4135-95BB-D414204A0AB4%40gmail.com
2025-07-07Standardize LSN formatting by zero paddingÁlvaro Herrera
This commit standardizes the output format for LSNs to ensure consistent representation across various tools and messages. Previously, LSNs were inconsistently printed as `%X/%X` in some contexts, while others used zero-padding. This often led to confusion when comparing. To address this, the LSN format is now uniformly set to `%X/%08X`, ensuring the lower 32-bit part is always zero-padded to eight hexadecimal digits. Author: Japin Li <japinli@hotmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
2025-07-02Add GetNamedDSA() and GetNamedDSHash().Nathan Bossart
Presently, the dynamic shared memory (DSM) registry only provides GetNamedDSMSegment(), which allocates a fixed-size segment. To use the DSM registry for more sophisticated things like dynamic shared memory areas (DSAs) or a hash table backed by a DSA (dshash), users need to create a DSM segment that stores various handles and LWLock tranche IDs and to write fairly complicated initialization code. Furthermore, there is likely little variation in this initialization code between libraries. This commit introduces functions that simplify allocating a DSA or dshash within the DSM registry. These functions are very similar to GetNamedDSMSegment(). Notable differences include the lack of an initialization callback parameter and the prohibition of calling the functions more than once for a given entry in each backend (which should be trivially avoidable in most circumstances). While at it, this commit bumps the maximum DSM registry entry name length from 63 bytes to 127 bytes. Also note that even though one could presumably detach/destroy the DSAs and dshashes created in the registry, such use-cases are not yet well-supported, if for no other reason than the associated DSM registry entries cannot be removed. Adding such support is left as a future exercise. The test_dsm_registry test module contains tests for the new functions and also serves as a complete usage example. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan
2025-07-01Silence valgrind about pg_numa_touch_mem_if_requiredTomas Vondra
When querying NUMA status of pages in shared memory, we need to touch the memory first to get valid results. This may trigger valgrind reports, because some of the memory (e.g. unpinned buffers) may be marked as noaccess. Solved by adding a valgrind suppresion. An alternative would be to adjust the access/noaccess status before touching the memory, but that seems far too invasive. It would require all those places to have detailed knowledge of what the shared memory stores. The pg_numa_touch_mem_if_required() macro is replaced with a function. Macros are invisible to suppressions, so it'd have to suppress reports for the caller - e.g. pg_get_shmem_allocations_numa(). So we'd suppress reports for the whole function, and that seems to heavy-handed. It might easily hide other valid issues. Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aEtDozLmtZddARdB@msg.df7cb.de Backpatch-through: 18
2025-05-23Revert function to get memory context stats for processesDaniel Gustafsson
Due to concerns raised about the approach, and memory leaks found in sensitive contexts the functionality is reverted. This reverts commits 45e7e8ca9, f8c115a6c, d2a1ed172, 55ef7abf8 and 042a66291 for v18 with an intent to revisit this patch for v19. Discussion: https://postgr.es/m/594293.1747708165@sss.pgh.pa.us
2025-05-08Use 'void *' for arbitrary buffers, 'uint8 *' for byte arraysHeikki Linnakangas
A 'void *' argument suggests that the caller might pass an arbitrary struct, which is appropriate for functions like libc's read/write, or pq_sendbytes(). 'uint8 *' is more appropriate for byte arrays that have no structure, like the cancellation keys or SCRAM tokens. Some places used 'char *', but 'uint8 *' is better because 'char *' is commonly used for null-terminated strings. Change code around SCRAM, MD5 authentication, and cancellation key handling to follow these conventions. Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org
2025-05-07Fix some comments related to IO workersMichael Paquier
IO workers are treated as auxiliary processes. The comments fixed in this commit stated that there could be only one auxiliary process of each BackendType at the same time. This is not true for IO workers, as up to MAX_IO_WORKERS of them can co-exist at the same time. Author: Cédric Villemain <Cedric.Villemain@data-bene.io> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/e4a3ac45-abce-4b58-a043-b4a31cd11113@Data-Bene.io
2025-04-09Cleanup of pg_numa.cTomas Vondra
This moves/renames some of the functions defined in pg_numa.c: * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and moved to src/backend/storage/ipc/shmem.c. The new name better reflects that the page size is not related to NUMA, and it's specifically about the page size used for the main shared memory segment. * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into the backend (which more appropriate for functions callable from SQL). While at it, improve the comment to explain what page size it returns. * remove unnecessary includes from src/port/pg_numa.c, adding unnecessary dependencies (src/port should be suitable for frontent). These were either leftovers or unnecessary thanks to the other changes in this commit. This eliminates unnecessary dependencies on backend symbols, which we don't want in src/port. Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
2025-04-08Add function to get memory context stats for processesDaniel Gustafsson
This adds a function for retrieving memory context statistics and information from backends as well as auxiliary processes. The intended usecase is cluster debugging when under memory pressure or unanticipated memory usage characteristics. When calling the function it sends a signal to the specified process to submit statistics regarding its memory contexts into dynamic shared memory. Each memory context is returned in detail, followed by a cumulative total in case the number of contexts exceed the max allocated amount of shared memory. Each process is limited to use at most 1Mb memory for this. A summary can also be explicitly requested by the user, this will return the TopMemoryContext and a cumulative total of all lower contexts. In order to not block on busy processes the caller specifies the number of seconds during which to retry before timing out. In the case where no statistics are published within the set timeout, the last known statistics are returned, or NULL if no previously published statistics exist. This allows dash- board type queries to continually publish even if the target process is temporarily congested. Context records contain a timestamp to indicate when they were submitted. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
2025-04-08Stabilize 035_standby_logical_decoding.pl.Amit Kapila
Some tests try to invalidate logical slots on the standby server by running VACUUM on the primary. The problem is that xl_running_xacts was getting generated and replayed before the VACUUM command, leading to the advancement of the active slot's catalog_xmin. Due to this, active slots were not getting invalidated, leading to test failures. We fix it by skipping the generation of xl_running_xacts for the required tests with the help of injection points. As the required interface for injection points was not present in back branches, we fixed the failing tests in them by disallowing the slot to become active for the required cases (where rows_removed conflict could be generated). Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal
2025-04-07Introduce pg_shmem_allocations_numa viewTomas Vondra
Introduce new pg_shmem_alloctions_numa view with information about how shared memory is distributed across NUMA nodes. For each shared memory segment, the view returns one row for each NUMA node backing it, with the total amount of memory allocated from that node. The view may be relatively expensive, especially when executed for the first time in a backend, as it has to touch all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Unlike pg_shmem_allocations, the view does not show anonymous shared memory allocations. It also does not show memory allocated using the dynamic shared memory infrastructure. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
2025-04-04Revert "Improve accounting for memory used by shared hash tables"Tomas Vondra
This reverts commit f5930f9a98ea65d659d41600a138e608988ad122. This broke the expansion of private hash tables, which reallocates the directory. But that's impossible when it's allocated together with the other fields, and dir_realloc() failed with BogusFree. Clearly, this needs rethinking. Discussion: https://postgr.es/m/CAApHDvriCiNkm=v521AP6PKPfyWkJ++jqZ9eqX4cXnhxLv8w-A@mail.gmail.com
2025-04-02Improve accounting for memory used by shared hash tablesTomas Vondra
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(), but for shared hash tables that covered only the header and hash directory. The remaining parts (segments and buckets) were allocated later using ShmemAlloc(), which does not update the shmem accounting. Thus, these allocations were not shown in pg_shmem_allocations. This commit improves the situation by allocating all the hash table parts at once, using a single ShmemInitStruct() call. This way the ShmemIndex entries (and thus pg_shmem_allocations) better reflect the proper size of the hash table. This affects allocations for private (non-shared) hash tables too, as the hash_create() code is shared. For non-shared tables this however makes no practical difference. This changes the alignment a bit. ShmemAlloc() aligns the chunks using CACHELINEALIGN(), which means some parts (header, directory, segments) were aligned this way. Allocating all parts as a single chunk removes this (implicit) alignment. We've considered adding explicit alignment, but we've decided not to - it seems to be merely a coincidence due to using the ShmemAlloc() API, not due to necessity. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
2025-04-02Make cancel request keys longerHeikki Linnakangas
Currently, the cancel request key is a 32-bit token, which isn't very much entropy. If you want to cancel another session's query, you can brute-force it. In most environments, an unauthorized cancellation of a query isn't very serious, but it nevertheless would be nice to have more protection from it. Hence make the key longer, to make it harder to guess. The longer cancellation keys are generated when using the new protocol version 3.2. For connections using version 3.0, short 4-bytes keys are still used. The new longer key length is not hardcoded in the protocol anymore, the client is expected to deal with variable length keys, up to 256 bytes. This flexibility allows e.g. a connection pooler to add more information to the cancel key, which might be useful for finding the connection. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi
2025-03-29Use PRI?64 instead of "ll?" in format strings (continued).Peter Eisentraut
Continuation of work started in commit 15a79c73, after initial trial. Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/b936d2fb-590d-49c3-a615-92c3a88c6c19%40eisentraut.org
2025-03-23Fix rare assertion failure in standby, if primary is restartedHeikki Linnakangas
During hot standby, ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() functions mark old transactions as no-longer running, but they failed to update xactCompletionCount and latestCompletedXid. AFAICS it would not lead to incorrect query results, because those functions effectively turn in-progress transactions into aborted transactions and an MVCC snapshot considers both as "not visible". But it could surprise GetSnapshotDataReuse() and trigger the "TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin))" assertion in it, if the apparent xmin in a backend would move backwards. We saw this happen when GetCatalogSnapshot() would reuse an older catalog snapshot, when GetTransactionSnapshot() had already advanced TransactionXmin. The bug goes back all the way to commit 623a9ba79b in v14 that introduced the snapshot reuse mechanism, but it started to happen more frequently with commit 952365cded6 which removed a GetTransactionSnapshot() call from backend startup. That made it more likely for ExpireOldKnownAssignedTransactionIds() to be called between GetCatalogSnapshot() and the first GetTransactionSnapshot() in a backend. Andres Freund first spotted this assertion failure on buildfarm member 'skink'. Reproduction and analysis by Tomas Vondra. Backpatch-through: 14 Discussion: https://www.postgresql.org/message-id/oey246mcw43cy4qw2hqjmurbd62lfdpcuxyqiu7botx3typpax%40h7o7mfg5zmdj
2025-03-17aio: Basic subsystem initializationAndres Freund
This commit just does the minimal wiring up of the AIO subsystem, added in the next commit, to the rest of the system. The next commit contains more details about motivation and architecture. This commit is kept separate to make it easier to review, separating the changes across the tree, from the implementation of the new subsystem. We discussed squashing this commit with the main commit before merging AIO, but there has been a mild preference for keeping it separate. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
2025-03-05Split WaitEventSet functions to separate source fileHeikki Linnakangas
latch.c now only contains the Latch related functions, which build on the WaitEventSet abstraction. Most of the platform-dependent stuff is now in waiteventset.c. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi
2025-03-05Use ModifyWaitEvent to update exit_on_postmaster_deathHeikki Linnakangas
This is in preparation for splitting WaitEventSet related functions to a separate source file. That will hide the details of WaitEventSet from WaitLatch, so it must use an exposed function instead of modifying WaitEventSet->exit_on_postmaster_death directly. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi
2025-03-05Remove unused ShutdownLatchSupport() functionHeikki Linnakangas
The only caller was removed in commit 80a8f95b3b. I don't foresee needing it any time soon, and I'm working on some big changes in this area, so let's remove it out of the way. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi
2025-02-27Fix possible double-release of spinlock in procsignal.cMichael Paquier
9d9b9d46f3c5 has added spinlocks to protect the fields in ProcSignal flags, introducing a code path in ProcSignalInit() where a spinlock could be released twice if the pss_pid field of a ProcSignalSlot is found as already set. Multiple spinlock releases have no effect with most spinlock implementations, but this could cause the code to run into issues when the spinlock is acquired concurrently by a different process. This sanity check on pss_pid generates a LOG that can be delayed until after the spinlock is released as, like older versions up to v17, the code expects the initialization of the ProcSignalSlot to happen even if pss_pid is found incorrect. The code is changed so as the old pss_pid is read while holding the slot's spinlock, with the LOG from the sanity check generated after releasing the spinlock, preventing the double release. Author: Maksim Melnikov <m.melnikov@postgrespro.ru> Co-authored-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/dca47527-2d8b-4e3b-b5a0-e2deb73371a4@postgrespro.ru
2025-02-13Remove unnecessary (char *) casts [xlog]Peter Eisentraut
Remove (char *) casts no longer needed after XLogRegisterData() and XLogRegisterBufData() argument type change. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org