summaryrefslogtreecommitdiff
path: root/src/backend/replication/logical/reorderbuffer.c
AgeCommit message (Collapse)Author
8 daysUse palloc_object() and palloc_array() in backend codeMichael Paquier
The idea is to encourage more the use of these new routines across the tree, as these offer stronger type safety guarantees than palloc(). This batch of changes includes most of the trivial changes suggested by the author for src/backend/. A total of 334 files are updated here. Among these files, 48 of them have their build change slightly; these are caused by line number changes as the new allocation formulas are simpler, shaving around 100 lines of code in total. Similar work has been done in 0c3c5c3b06a3 and 31d3847a37be. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-11-06Use XLogRecPtrIsValid() in various placesÁlvaro Herrera
Now that commit 06edbed47862 has introduced XLogRecPtrIsValid(), we can use that instead of: - XLogRecPtrIsInvalid() - direct comparisons with InvalidXLogRecPtr - direct comparisons with literal 0 This makes the code more consistent. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal
2025-10-21Use CompactAttribute more often, when possibleDavid Rowley
5983a4cff added CompactAttribute for storing commonly used fields from FormData_pg_attribute. 5983a4cff didn't go to the trouble of adjusting every location where we can use CompactAttribute rather than FormData_pg_attribute, so here we change the remaining ones. There are some locations where I've left the code using FormData_pg_attribute. These are mostly in the ALTER TABLE code. Using CompactAttribute here seems more risky as often the TupleDesc is being changed and those changes may not have been flushed to the CompactAttribute yet. I've also left record_recv(), record_send(), record_cmp(), record_eq() and record_image_eq() alone as it's not clear to me that accessing the CompactAttribute is a win here due to the FormData_pg_attribute still having to be accessed for most cases. Switching the relevant parts to use CompactAttribute would result in having to access both for common cases. Careful benchmarking may reveal that something can be done to make this better, but in absence of that, the safer option is to leave these alone. In ReorderBufferToastReplace(), there was a check to skip attnums < 0 while looping over the TupleDesc. Doing this is redundant since TupleDescs don't store < 0 attnums. Removing that code allows us to move to using CompactAttribute. The change in validateDomainCheckConstraint() just moves fetching the FormData_pg_attribute into the ERROR path, which is cold due to calling errstart_cold() and results in code being moved out of the common path. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com
2025-10-08Add mem_exceeded_count column to pg_stat_replication_slots.Masahiko Sawada
This commit introduces a new column mem_exceeded_count to the pg_stat_replication_slots view. This counter tracks how often the memory used by logical decoding exceeds the logical_decoding_work_mem limit. The new statistic helps users determine whether exceeding the logical_decoding_work_mem limit is a rare occurrences or a frequent issue, information that wasn't available through existing statistics. Bumps catversion. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/978D21E8-9D3B-40EA-A4B1-F87BABE7868C@yesql.se
2025-09-30Make some use of anonymous unions [reorderbuffer xact_time]Peter Eisentraut
Make some use of anonymous unions, which are allowed as of C11, as examples and encouragement for future code, and to test compilers. This commit changes the ReorderBufferTXN struct. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f00a9968-388e-4f8c-b5ef-5102e962d997%40eisentraut.org
2025-09-12Avoid unexpected changes of CurrentResourceOwner and CurrentMemoryContextÁlvaro Herrera
Users of logical decoding can encounter an unexpected change of CurrentResourceOwner and CurrentMemoryContext. The problem is that, unlike other call sites of RollbackAndReleaseCurrentSubTransaction(), in reorderbuffer.c we fail to restore the original values of these global variables after being clobbered by subtransaction abort. This patch saves the values prior to the call and restores them eventually. In addition, logical.c and logicalfuncs.c had a hack to restore resource owner, presumably because of lack of this restore. Remove that. Instead, because the test coverage here is not very consistent, add an Assert() to ensure that the resowner is kept identical; this would make it easy to detect other cases of bugs were we fail to restore resowner properly. This could be removed later. This is arguably an old bug, but there appears to be no reason to backpatch it and it's risky to do so, so refrain for now. Author: Antonin Houska <ah@cybertec.at> Reported-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/119497.1756892972@localhost
2025-08-02Fix use-after-free with INSERT ON CONFLICT changes in reorderbuffer.cMichael Paquier
In ReorderBufferProcessTXN(), used to send the data of a transaction to an output plugin, INSERT ON CONFLICT changes (INTERNAL_SPEC_INSERT) are delayed until a confirmation record arrives (INTERNAL_SPEC_CONFIRM), updating the change being processed. 8c58624df462 has added an extra step after processing a change to update the progress of the transaction, by calling the callback update_progress_txn() based on the LSN stored in a change after a threshold of CHANGES_THRESHOLD (100) is reached. This logic has missed the fact that for an INSERT ON CONFLICT change the data is freed once processed, hence update_progress_txn() could be called pointing to a LSN value that's already been freed. This could result in random crashes, depending on the workload. Per discussion, this issue is fixed by reusing in update_progress_txn() the LSN from the change processed found at the beginning of the loop, meaning that for a INTERNAL_SPEC_CONFIRM change the progress is updated using the LSN of the INTERNAL_SPEC_CONFIRM change, and not the LSN from its INTERNAL_SPEC_INSERT change. This is actually more correct, as we want to update the progress to point to the INTERNAL_SPEC_CONFIRM change. Masahiko Sawada has found a nice trick to reproduce the issue: hardcode CHANGES_THRESHOLD at 1 and run test_decoding (test "ddl" being enough) on an instance running valgrind. The bug has been analyzed by Ethan Mertz, who also originally suggested the solution used in this patch. Issue introduced by 8c58624df462, so backpatch down to v16. Author: Ethan Mertz <ethan.mertz@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/aIsQqDZ7x4LAQ6u1@paquier.xyz Backpatch-through: 16
2025-07-23Preserve conflict-relevant data during logical replication.Amit Kapila
Logical replication requires reliable conflict detection to maintain data consistency across nodes. To achieve this, we must prevent premature removal of tuples deleted by other origins and their associated commit_ts data by VACUUM, which could otherwise lead to incorrect conflict reporting and resolution. This patch introduces a mechanism to retain deleted tuples on the subscriber during the application of concurrent transactions from remote nodes. Retaining these tuples allows us to correctly ignore concurrent updates to the same tuple. Without this, an UPDATE might be misinterpreted as an INSERT during resolutions due to the absence of the original tuple. Additionally, we ensure that origin metadata is not prematurely removed by vacuum freeze, which is essential for detecting update_origin_differs and delete_origin_differs conflicts. To support this, a new replication slot named pg_conflict_detection is created and maintained by the launcher on the subscriber. Each apply worker tracks its own non-removable transaction ID, which the launcher aggregates to determine the appropriate xmin for the slot, thereby retaining necessary tuples. Conflict information retention (deleted tuples and commit_ts) can be enabled per subscription via the retain_conflict_info option. This is disabled by default to avoid unnecessary overhead for configurations that do not require conflict resolution or logging. During upgrades, if any subscription on the old cluster has retain_conflict_info enabled, a conflict detection slot will be created to protect relevant tuples from deletion when the new cluster starts. This is a foundational work to correctly detect update_deleted conflict which will be done in a follow-up patch. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-07-01Make more use of binaryheap_empty() and binaryheap_size().Nathan Bossart
A few places were accessing bh_size directly instead of via these handy macros. Author: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAJ7c6TPQMVL%2B028T4zuw9ZqL5Du9JavOLhBQLkJeK0RznYx_6w%40mail.gmail.com
2025-06-17Fix re-distributing previously distributed invalidation messages during ↵Masahiko Sawada
logical decoding. Commit 4909b38af0 introduced logic to distribute invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. However, since each transaction distributes not only its original invalidation messages but also previously distributed messages to other transactions, this leads to an exponential increase in allocation request size for invalidation messages, ultimately causing memory allocation failure. This commit fixes this issue by tracking distributed invalidation messages separately per decoded transaction and not redistributing these messages to other in-progress transactions. The maximum size of distributed invalidation messages that one transaction can store is limited to MAX_DISTR_INVAL_MSG_PER_TXN (8MB). Once the size of the distributed invalidation messages exceeds this threshold, we invalidate all caches in locations where distributed invalidation messages need to be executed. Back-patch to all supported versions where we introduced the fix by commit 4909b38af0. Note that this commit adds two new fields to ReorderBufferTXN to store the distributed transactions. This change breaks ABI compatibility in back branches, affecting third-party extensions that depend on the size of the ReorderBufferTXN struct, though this scenario seems unlikely. Additionally, it adds a new flag to the txn_flags field of ReorderBufferTXN to indicate distributed invalidation message overflow. This should not affect existing implementations, as it is unlikely that third-party extensions use unused bits in the txn_flags field. Bug: #18938 #18942 Author: vignesh C <vignesh21@gmail.com> Reported-by: Duncan Sands <duncan.sands@deepbluecap.com> Reported-by: John Hutchins <john.hutchins@wicourts.gov> Reported-by: Laurence Parry <greenreaper@hotmail.com> Reported-by: Max Madden <maxmmadden@gmail.com> Reported-by: Braulio Fdo Gonzalez <brauliofg@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/680bdaf6-f7d1-4536-b580-05c2760c67c6@deepbluecap.com Discussion: https://postgr.es/m/18942-0ab1e5ae156613ad@postgresql.org Discussion: https://postgr.es/m/18938-57c9a1c463b68ce0@postgresql.org Discussion: https://postgr.es/m/CAD1FGCT2sYrP_70RTuo56QTizyc+J3wJdtn2gtO3VttQFpdMZg@mail.gmail.com Discussion: https://postgr.es/m/CANO2=B=2BT1hSYCE=nuuTnVTnjidMg0+-FfnRnqM6kd23qoygg@mail.gmail.com Backpatch-through: 13
2025-04-10Fix data loss in logical replication.Amit Kapila
Data loss can happen when the DDLs like ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take a strong lock on table happens concurrently to DMLs on the tables involved in the DDL. This happens because logical decoding doesn't distribute invalidations to concurrent transactions and those transactions use stale cache data to decode the changes. The problem becomes bigger because we keep using the stale cache even after those in-progress transactions are finished and skip the changes required to be sent to the client. This commit fixes the issue by distributing invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. This allows the necessary rebuild of the catalog cache when decoding new changes after concurrent DDL. We observed performance regression primarily during frequent execution of *publication DDL* statements that modify the published tables. The regression is minor or nearly nonexistent for DDLs that do not affect the published tables or occur infrequently, making this a worthwhile cost to resolve a longstanding data loss issue. An alternative approach considered was to take a strong lock on each affected table during publication modification. However, this would only address issues related to publication DDLs (but not the ALTER TYPE ...) and require locking every relation in the database for publications created as FOR ALL TABLES, which is impractical. The bug exists in all supported branches, but we are backpatching till 14. The fix for 13 requires somewhat bigger changes than this fix, so the fix for that branch is still under discussion. Reported-by: hubert depesz lubaczewski <depesz@depesz.com> Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Tested-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Backpatch-through: 14 Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Discussion: https://postgr.es/m/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
2025-03-12Rename alloc/free functions in reorderbuffer.cHeikki Linnakangas
There used to be bespoken pools for these structs to reduce the palloc/pfree overhead, but that was ripped out a long time ago and replaced with the generic, cheaper generational memory allocator (commit a4ccc1cef5). The Get/Return terminology made sense with the pools, as you "got" an object from the pool and "returned" it later, but now it just looks weird. Rename to Alloc/Free. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/c9e43d2d-8e83-444f-b111-430377368989@iki.fi
2025-02-25Change relpath() et al to return path by valueAndres Freund
For AIO, and also some other recent patches, we need the ability to call relpath() in a critical section. Until now that was not feasible, as it allocated memory. The fact that relpath() allocated memory also made it awkward to use in log messages because we had to take care to free the memory afterwards. Which we e.g. didn't do for when zeroing out an invalid buffer. We discussed other solutions, e.g. filling a pre-allocated buffer that's passed to relpath(), but they all came with plenty downsides or were larger projects. The easiest fix seems to be to make relpath() return the path by value. To be able to return the path by value we need to determine the maximum length of a relation path. This patch adds a long #define that computes the exact maximum, which is verified to be correct in a regression test. As this change the signature of relpath(), extensions using it will need to adapt their code. We discussed leaving a backward-compat shim in place, but decided it's not worth it given the use of relpath() doesn't seem widespread. Discussion: https://postgr.es/m/xeri5mla4b5syjd5a25nok5iez2kr3bm26j2qn4u7okzof2bmf@kwdh2vf7npra
2025-02-13Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better clarification.Masahiko Sawada
RBTXN_PREPARE flag and rbtxn_prepared macro could be misinterpreted as either indicating the transaction type (e.g. a prepared transaction or a normal transaction) or its currentstate (e.g. skipped or its prepare message is sent), especially after commit 072ee847ad4 introduced the RBTXN_SENT_PREPARE flag and the rbtxn_sent_prepare macro. The RBTXN_PREPARE flag (and its corresponding macro) have been renamed to RBTXN_IS_PREPARE to explicitly indicate the transaction type. Therefore, this commit also adds the RBTXN_IS_PREPARE flag to the transaction that is a prepared transaction and has been skipped, which previously had only the RBTXN_SKIPPED_PREPARE flag. Reviewed-by: Amit Kapila, Peter Smith Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com
2025-02-13Skip logical decoding of already-aborted transactions.Masahiko Sawada
Previously, transaction aborts were detected concurrently only during system catalog scans while replaying a transaction in streaming mode. This commit adds an additional CLOG lookup to check the transaction status, allowing the logical decoding to skip changes also when it doesn't touch system catalogs, if the transaction is already aborted. This optimization enhances logical decoding performance, especially for large transactions that have already been rolled back, as it avoids unnecessary disk or network I/O. To avoid potential slowdowns caused by frequent CLOG lookups for small transactions (most of which commit), the CLOG lookup is performed only for large transactions before eviction. The performance benchmark results showed there is not noticeable performance regression due to CLOG lookups. Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian Reviewed-by: Dilip Kumar, Andres Freund Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com
2025-01-31Get rid of our dependency on type "long" for memory size calculations.Tom Lane
Consistently use "Size" (or size_t, or in some places int64 or double) as the type for variables holding memory allocation sizes. In most places variables' data types were fine already, but we had an ancient habit of computing bytes from kilobytes-units GUCs with code like "work_mem * 1024L". That risks overflow on Win64 where they did not make "long" as wide as "size_t". We worked around that by restricting such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB on Win64. This patch removes that restriction, after replacing such calculations with "work_mem * (Size) 1024" or variants of that. It should be noted that this patch was constructed by searching outwards from the GUCs that have MAX_KILOBYTES as upper limit. So I can't positively guarantee there are no other places doing memory-size arithmetic in int or long variables. I do however feel pretty confident that increasing MAX_KILOBYTES on Win64 is safe now. Also, nothing in our code should be dealing in multiple-gigabyte allocations without authorization from a relevant GUC, so it seems pretty likely that this search caught everything that could be at risk of overflow. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217
2025-01-29Fix grammatical typos around possessive "its"John Naylor
Some places spelled it "it's", which is short for "it is". In passing, fix a couple other nearby grammatical errors. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Discussion: https://postgr.es/m/CA+COZaAO8g1KJCV0T48=CkJMjAnnfTGLWOATz+2aCh40c2Nm+g@mail.gmail.com
2025-01-01Update copyright for 2025Bruce Momjian
Backpatch-through: 13
2024-12-20Fix variable reference in commentHeikki Linnakangas
This used to say "nsubxcnt isn't decreased when subtransactions abort", but there's no variable called nsubxcnt. Commit 8548ddc61b changed it to "subxcnt", among other typo fixes, but that was wrong too: the comment actually talks about txn->nsubtxns. That's the field that's incremented but never decremented and is used for the allocation earlier in the function.
2024-10-16Reduce memory block size for decoded tuple storage to 8kB.Masahiko Sawada
Commit a4ccc1cef introduced the Generation Context and modified the logical decoding process to use a Generation Context with a fixed block size of 8MB for storing tuple data decoded during logical decoding (i.e., rb->tup_context). Several reports have indicated that the logical decoding process can be terminated due to out-of-memory (OOM) situations caused by excessive memory usage in rb->tup_context. This issue can occur when decoding a workload involving several concurrent transactions, including a long-running transaction that modifies tuples. By design, the Generation Context does not free a memory block until all chunks within that block are released. Consequently, if tuples modified by the long-running transaction are stored across multiple memory blocks, these blocks remain allocated until the long-running transaction completes, leading to substantial memory fragmentation. The memory usage during logical decoding, tracked by rb->size, does not account for memory fragmentation, resulting in potentially much higher memory consumption than the value of the logical_decoding_work_mem parameter. Various improvement strategies were discussed in the relevant thread. This change reduces the block size of the Generation Context used in rb->tup_context from 8MB to 8kB. This modification significantly decreases the likelihood of substantial memory fragmentation occurring and is relatively straightforward to backport. Performance testing across multiple platforms has confirmed that this change will not introduce any performance degradation that would impact actual operation. Backport to all supported branches. Reported-by: Alex Richman, Michael Guissine, Avi Weinberg Reviewed-by: Amit Kapila, Fujii Masao, David Rowley Tested-by: Hayato Kuroda, Shlok Kyal Discussion: https://postgr.es/m/CAD21AoBTY1LATZUmvSXEssvq07qDZufV4AF-OHh9VD2pC0VY2A%40mail.gmail.com Backpatch-through: 12
2024-08-30Define PG_LOGICAL_DIR for path pg_logical/ in data folderMichael Paquier
This is similar to 2065ddf5e34c, but this time for pg_logical/ itself and its contents, like the paths for snapshots, mappings or origin checkpoints. Author: Bertrand Drouvot Reviewed-by: Ashutosh Bapat, Yugo Nagata, Michael Paquier Discussion: https://postgr.es/m/ZryVvjqS9SnV1GPP@ip-10-97-1-34.eu-west-3.compute.internal
2024-08-30Define PG_REPLSLOT_DIR for path pg_replslot/ in data folderMichael Paquier
This commit replaces most of the hardcoded values of "pg_replslot" by a new PG_REPLSLOT_DIR #define. This makes the style more consistent with the existing PG_STAT_TMP_DIR, for example. More places will follow a similar change. Author: Bertrand Drouvot Reviewed-by: Ashutosh Bapat, Yugo Nagata, Michael Paquier Discussion: https://postgr.es/m/ZryVvjqS9SnV1GPP@ip-10-97-1-34.eu-west-3.compute.internal
2024-08-26Fix identation.Masahiko Sawada
2024-08-26Fix memory counter update in ReorderBuffer.Masahiko Sawada
Commit 5bec1d6bc5e changed the memory usage updates of the ReorderBufferTXN to zero all at once by subtracting txn->size, rather than updating it for each change. However, if TOAST reconstruction data remained in the transaction when freeing it, there were cases where it further subtracted the memory counter from zero, resulting in an assertion failure. This change calculates the memory size for each change and updates the memory usage to precisely the amount that has been freed. Backpatch to v17, where this was introducd. Reviewed-by: Amit Kapila, Shlok Kyal Discussion: https://postgr.es/m/CAD21AoAqkNUvicgKPT_dXzNoOwpPkVTg0QPPxEcWmzT0moCJ1g%40mail.gmail.com Backpatch-through: 17
2024-04-11Revert indexed and enlargable binary heap implementation.Masahiko Sawada
This reverts commit b840508644 and bcb14f4abc. These commits were made for commit 5bec1d6bc5 (Improve eviction algorithm in ReorderBuffer using max-heap for many subtransactions). However, per discussion, commit efb8acc0d0 replaced binary heap + index with pairing heap, and made these commits unnecessary. Reported-by: Jeff Davis Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
2024-04-11Replace binaryheap + index with pairingheap in reorderbuffer.cMasahiko Sawada
A pairing heap can perform the same operations as the binary heap + index, with as good or better algorithmic complexity, and that's an existing data structure so that we don't need to invent anything new compared to v16. This commit makes the new binaryheap functionality that was added in commits b840508644 and bcb14f4abc unnecessary, but they will be reverted separately. Remove the optimization to only build and maintain the heap when the amount of memory used is close to the limit, becuase the bookkeeping overhead with the pairing heap seems to be small enough that it doesn't matter in practice. Reported-by: Jeff Davis Author: Heikki Linnakangas Reviewed-by: Michael Paquier, Hayato Kuroda, Masahiko Sawada Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
2024-04-03Improve eviction algorithm in ReorderBuffer using max-heap for many ↵Masahiko Sawada
subtransactions. Previously, when selecting the transaction to evict during logical decoding, we check all transactions to find the largest transaction. This could lead to a significant replication lag especially in the case where there are many subtransactions. This commit improves the eviction algorithm in ReorderBuffer using the max-heap with transaction size as the key to efficiently find the largest transaction. The max-heap starts with empty. While the max-heap is empty, we don't do anything for the max-heap when updating the memory counter. Therefore, we get the largest transaction in O(N) time, where N is the number of transactions including top-level transactions and subtransactions. We build the max-heap just before selecting the largest transactions if the number of transactions being decoded is higher than the threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also update the max-heap when updating the memory counter. The intention is to efficiently find the largest transaction in O(1) time instead of incurring the cost of memory counter updates (O(log N)). Once the number of transactions got lower than the threshold, we reset the max-heap. The performance benchmark results showed significant speed up (more than x30 speed up on my machine) in decoding a transaction with 100k subtransactions, whereas there is no visible overhead in other cases. Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra, Shubham Khanna, Peter Smith, Álvaro Herrera, Euler Taveira Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
2024-04-03Add functions to binaryheap for efficient key removal and update.Masahiko Sawada
Previously, binaryheap didn't support updating a key and removing a node in an efficient way. For example, in order to remove a node from the binaryheap, the caller had to pass the node's position within the array that the binaryheap internally has. Removing a node from the binaryheap is done in O(log n) but searching for the key's position is done in O(n). This commit adds a hash table to binaryheap in order to track the position of each nodes in the binaryheap. That way, by using newly added functions such as binaryheap_update_up() etc., both updating a key and removing a node can be done in O(1) on an average and O(log n) in worst case. This is known as the indexed binary heap. The caller can specify to use the indexed binaryheap by passing indexed = true. The current code does not use the new indexing logic, but it will be used by an upcoming patch. Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian, Tomas Vondra, Shubham Khanna Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
2024-03-14Fix typos in reorderbuffer.c.Amit Kapila
Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20240314.132817.1496502692848380820.horikyota.ntt@gmail.com
2024-03-04Remove unused #include's from backend .c filesPeter Eisentraut
as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-02-16Use new overflow-safe integer comparison functions.Nathan Bossart
Commit 6b80394781 introduced integer comparison functions designed to be as efficient as possible while avoiding overflow. This commit makes use of these functions in many of the in-tree qsort() comparators to help ensure transitivity. Many of these comparator functions should also see a small performance boost. Author: Mats Kindahl Reviewed-by: Andres Freund, Fabrízio de Royes Mello Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
2024-01-29Remove ReorderBufferTupleBuf structure.Masahiko Sawada
Since commit a4ccc1cef, the 'node' and 'alloc_tuple_size' fields of the ReorderBufferTupleBuf structure are no longer used. This leaves only the 'tuple' field in the structure. Since keeping a single-field structure makes little sense, the ReorderBufferTupleBuf is removed entirely. The code is refactored accordingly. No back-patching since these are ABI changes in an exposed structure and functions, and there would be some risk of breaking extensions. Author: Aleksander Alekseev Reviewed-by: Amit Kapila, Masahiko Sawada, Reid Thompson Discussion: https://postgr.es/m/CAD21AoCvnuxiXXfRecp7g9+CeC35POQfhuQeJFr7_9u_Q5jc_Q@mail.gmail.com
2024-01-04Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2023-08-30Fix some shadow variables in src/backend/replication/Michael Paquier
The code is able to compile already without warnings under -Wshadow=compatible-local, which is itself already enabled in the tree, and the ones fixed here showed up with the more restrictive -Wshadow. There are more of these that we may want to look at, and the ones fixed here made the code confusing. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+PuR0y4ofNOxi691VTVWmBfScHV9AaBMGSpeh8+DKp81Nw@mail.gmail.com
2023-08-29Rename logical_replication_mode to debug_logical_replication_streamingPeter Eisentraut
The logical_replication_mode GUC is intended for testing and debugging purposes, but its current name may be misleading and encourage users to make unnecessary changes. To avoid confusion, renaming the GUC to a less misleading name debug_logical_replication_streaming that casual users are less likely to mistakenly assume needs to be modified in a regular logical replication setup. Author: Hou Zhijie <houzj.fnst@cn.fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/d672d774-c44b-6fec-f993-793e744f169a%40eisentraut.org
2023-08-02Fix ReorderBufferCheckMemoryLimit() comment.Masahiko Sawada
Commit 7259736a6 updated the comment but it was not correct since ReorderBufferLargestStreamableTopTXN() returns only top-level transactions. Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAD21AoA9XB7OR86BqvrCe2dMYX%2BZv3-BvVmjF%3DGY2z6jN-kqjg%40mail.gmail.com Backpatch-through: 14
2023-05-19Pre-beta mechanical code beautification.Tom Lane
Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-02Fix typos in commentsMichael Paquier
The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
2023-03-17Add macros for ReorderBufferTXN toptxn.Amit Kapila
Currently, there are quite a few places in reorderbuffer.c that tries to access top-transaction for a subtransaction. This makes the code to access top-transaction consistent and easier to follow. Author: Peter Smith Reviewed-by: Vignesh C, Sawada Masahiko Discussion: https://postgr.es/m/CAHut+PuCznOyTqBQwjRUu-ibG-=KHyCv-0FTcWQtZUdR88umfg@mail.gmail.com
2023-02-22Fix snapshot handling in logicalmsg_decodeTomas Vondra
Whe decoding a transactional logical message, logicalmsg_decode called SnapBuildGetOrBuildSnapshot. But we may not have a consistent snapshot yet at that point. We don't actually need the snapshot in this case (during replay we'll have the snapshot from the transaction), so in practice this is harmless. But in assert-enabled build this crashes. Fixed by requesting the snapshot only in non-transactional case, where we are guaranteed to have SNAPBUILD_CONSISTENT. Backpatch to 11. The issue exists since 9.6. Backpatch-through: 11 Reviewed-by: Andres Freund Discussion: https://postgr.es/m/84d60912-6eab-9b84-5de3-41765a5449e8@enterprisedb.com
2023-02-08Fix the logical replication timeout during large DDLs.Amit Kapila
The DDLs like Refresh Materialized views that generate lots of temporary data due to rewrite rules may not be processed by output plugins (for example pgoutput). So, we won't send keep-alive messages for a long time while processing such commands and that can lead the subscriber side to timeout. We have previously fixed a similar case for large transactions in commit f95d53eded where the output plugin filters all or most of the changes but missed to handle the DDLs. We decided not to backpatch this as this adds a new callback in the existing exposed structure and moreover, users can increase the wal_sender_timeout and wal_receiver_timeout to avoid this problem. Author: Wang wei, Hou Zhijie Reviewed-by: Peter Smith, Ashutosh Bapat, Shi yu, Amit Kapila Discussion: https://postgr.es/m/OS3PR01MB6275478E5D29E4A563302D3D9E2B9@OS3PR01MB6275.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/CAA5-nLARN7-3SLU_QUxfy510pmrYK6JJb=bk3hcgemAM_pAv+w@mail.gmail.com
2023-02-06Remove useless casts to (void *) in hash_search() callsPeter Eisentraut
Some of these appear to be leftovers from when hash_search() took a char * argument (changed in 5999e78fc45dcb91784b64b6e9ae43f4e4f68ca2). Since after this there is some more horizontal space available, do some light reformatting where suitable. Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
2023-01-30Rename GUC logical_decoding_mode to logical_replication_mode.Amit Kapila
Rename the developer option 'logical_decoding_mode' to the more flexible name 'logical_replication_mode' because doing so will make it easier to extend this option in the future to help test other areas of logical replication. Currently, it is used on the publisher side to allow streaming or serializing each change in logical decoding. In the upcoming patch, we are planning to use it on the subscriber. On the subscriber, it will allow serializing the changes to file and notifies the parallel apply workers to read and apply them at the end of the transaction. We discussed exposing this parameter as a subscription option but it did not seem advisable since it is primarily used for testing/debugging and there is no other such parameter. We also discussed having separate GUCs for publisher and subscriber but for current testing/debugging requirements, one GUC is sufficient. Author: Hou Zhijie Reviewed-by: Peter Smith, Kuroda Hayato, Sawada Masahiko, Amit Kapila Discussion: https://postgr.es/m/CAD21AoAy2c=Mx=FTCs+EwUsf2kQL5MmU3N18X84k0EmCXntK4g@mail.gmail.com Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-09Perform apply of large transactions by parallel workers.Amit Kapila
Currently, for large transactions, the publisher sends the data in multiple streams (changes divided into chunks depending upon logical_decoding_work_mem), and then on the subscriber-side, the apply worker writes the changes into temporary files and once it receives the commit, it reads from those files and applies the entire transaction. To improve the performance of such transactions, we can instead allow them to be applied via parallel workers. In this approach, we assign a new parallel apply worker (if available) as soon as the xact's first stream is received and the leader apply worker will send changes to this new worker via shared memory. The parallel apply worker will directly apply the change instead of writing it to temporary files. However, if the leader apply worker times out while attempting to send a message to the parallel apply worker, it will switch to "partial serialize" mode - in this mode, the leader serializes all remaining changes to a file and notifies the parallel apply workers to read and apply them at the end of the transaction. We use a non-blocking way to send the messages from the leader apply worker to the parallel apply to avoid deadlocks. We keep this parallel apply assigned till the transaction commit is received and also wait for the worker to finish at commit. This preserves commit ordering and avoid writing to and reading from files in most cases. We still need to spill if there is no worker available. This patch also extends the SUBSCRIPTION 'streaming' parameter so that the user can control whether to apply the streaming transaction in a parallel apply worker or spill the change to disk. The user can set the streaming parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means the streaming will be applied via a parallel apply worker, if available. The parameter value 'on' means the streaming transaction will be spilled to disk. The default value is 'off' (same as current behaviour). In addition, the patch extends the logical replication STREAM_ABORT message so that abort_lsn and abort_time can also be sent which can be used to update the replication origin in parallel apply worker when the streaming transaction is aborted. Because this message extension is needed to support parallel streaming, parallel streaming is not supported for publications on servers < PG16. Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-07Remove the streaming files for incomplete xacts after restart.Amit Kapila
After restart, we try to stream the changes for large transactions that were not sent before server crash and restart. However, we forget to send the abort message for such transactions. This leads to spurious streaming files on the subscriber which won't be cleaned till the apply worker or the subscriber server restarts. Reported-by: Dilip Kumar Author: Hou Zhijie Reviewed-by: Dilip Kumar and Amit Kapila Backpatch-through: 14 Discussion: https://postgr.es/m/OS0PR01MB5716A773F46768A1B75BE24394FB9@OS0PR01MB5716.jpnprd01.prod.outlook.com
2023-01-02Update copyright for 2023Bruce Momjian
Backpatch-through: 11
2022-12-26Add 'logical_decoding_mode' GUC.Amit Kapila
This enables streaming or serializing changes immediately in logical decoding. This parameter is intended to be used to test logical decoding and replication of large transactions for which otherwise we need to generate the changes till logical_decoding_work_mem is reached. This helps in reducing the timing of existing tests related to logical replication of in-progress transactions and will help in writing tests for for the upcoming feature for parallelly applying large in-progress transactions. Author: Shi yu Reviewed-by: Sawada Masahiko, Shveta Mallik, Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi Discussion: https://postgr.es/m/OSZPR01MB63104E7449DBE41932DB19F1FD1B9@OSZPR01MB6310.jpnprd01.prod.outlook.com
2022-12-08Avoid unnecessary streaming of transactions during logical replication.Amit Kapila
After restart, we don't perform streaming of an in-progress transaction if it was previously decoded and confirmed by the client. To achieve that we were comparing the END location of the WAL record being decoded with the WAL location we have already decoded and confirmed by the client. While decoding the commit record, to decide whether to process and send the complete transaction, we compare its START location with the WAL location we have already decoded and confirmed by the client. Now, if we need to queue some change in the transaction while decoding the commit record (e.g. snapshot), it is possible that we decide to stream the transaction but later commit processing decides to skip it. In such a case, we would needlessly send the changes and later when we decide to skip it, we will send stream abort. We also sometimes decide to stream the changes when we actually just need to process them locally like a change for invalidations. This will lead us to send empty streams. To avoid this, while queuing each change for decoding, we remember whether the transaction has any change that actually needs to be sent downstream and use that information later to decide whether to stream the transaction or not. Note, we can't avoid all cases where we have to send empty streams like the case where the plugin later decides that the change is not publishable. However, we will no longer need to send stream_abort when we skip sending a particular transaction. Author: Dilip Kumar Reviewed-by: Hou Zhijie, Ashutosh Bapat, Shi yu, Amit Kapila Discussion: https://postgr.es/m/CAFiTN-tHK=7LzfrPs8fbT2ksrOJGQbzywcgXst2bM9-rJJAAUg@mail.gmail.com
2022-11-10Fix comments atop ReorderBufferAddInvalidations.Amit Kapila
The comments atop seem to indicate that we always accumulate invalidation messages in a top-level transaction which is neither required nor matches with the code. Author: Amit Kapila Reviewd by: Masahiko Sawada Backpatch-through: 14, where it was introduced in commit c55040ccd0 Discussion: https://postgr.es/m/CAA4eK1LxGgnUroPz8STb6OfjVU1yaHoSA+T63URwmGCLdMJ0LA@mail.gmail.com
2022-11-02Add doubly linked count list implementationDavid Rowley
We have various requirements when using a dlist_head to keep track of the number of items in the list. This, traditionally, has been done by maintaining a counter variable in the calling code. Here we tidy this up by adding "dclist", which is very similar to dlist but also keeps track of the number of items stored in the list. Callers may use the new dclist_count() function when they need to know how many items are stored. Obtaining the count is an O(1) operation. For simplicity reasons, dclist and dlist both use dlist_node as their node type and dlist_iter/dlist_mutable_iter as their iterator type. dclists have all of the same functionality as dlists except there is no function named dclist_delete(). To remove an item from a list dclist_delete_from() must be used. This requires knowing which dclist the given item is stored in. Additionally, here we also convert some dlists where additional code exists to keep track of the number of items stored and to make these use dclists instead. Author: David Rowley Reviewed-by: Bharath Rupireddy, Aleksander Alekseev Discussion: https://postgr.es/m/CAApHDvrtVxr+FXEX0VbViCFKDGxA3tWDgw9oFewNXCJMmwLjLg@mail.gmail.com