| Age | Commit message (Collapse) | Author |
|
We already do this in CreateParallelContext, InitializeParallelDSM, and
LaunchParallelWorkers. I suspect the reason why the matching logic was
omitted from ReinitializeParallelDSM is that I failed to realize that
any memory allocation was happening here -- but shm_mq_attach does
allocate, which could result in a shm_mq_handle being allocated in a
shorter-lived context than the ParallelContext which points to it.
That could result in a crash if the shorter-lived context is freed
before the parallel context is destroyed. As far as I am currently
aware, there is no way to reach a crash using only code that is
present in core PostgreSQL, but extensions could potentially trip
over this. Fixing this in the back-branches appears low-risk, so
back-patch to all supported versions.
Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
Backpatch-through: 14
Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com
|
|
This commit provides test coverage for dc7c77f825d7, where the redo
record and the checkpoint record finish on different WAL segments with
the start of recovery able to detect that the redo record is missing.
This test uses a wait injection point done in the critical section of a
checkpoint, method that requires not one but actually two wait injection
points to avoid any memory allocations within the critical section of
the checkpoint:
- Checkpoint run with a background psql.
- One first wait point is run by the checkpointer before the critical
section, allocating the shared memory required by the DSM registry for
the wait machinery in the library injection_points.
- First point is woken up.
- Second wait point is loaded before the critical section, allocating
the memory to build the path to the library loaded, then run in the
critical section once the checkpoint redo record has been logged.
- WAL segment is switched while waiting on the second point.
- Checkpoint completes.
- Stop cluster with immediate mode.
- The segment that includes the redo record is removed.
- Start, recovery fails as the redo record cannot be found.
The error message introduced in dc7c77f825d7 is now reduced to a FATAL,
meaning that the information is still provided while being able to use a
test for it. Nitin has provided a basic version of the test, that I
have enhanced to make it portable with two points. Without
dc7c77f825d7, the cluster crashes in this test, not on a PANIC but due
to the pointer dereference at the beginning of recovery, failure
mentioned in the other commit.
Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
|
|
This commit adds an extra check at the beginning of recovery to ensure
that the redo record of a checkpoint exists before attempting WAL
replay, logging a PANIC if the redo record referenced by the checkpoint
record could not be found. This is the same level of failure as when a
checkpoint record is missing. This check is added when a cluster is
started without a backup_label, after retrieving its checkpoint record.
The redo LSN used for the check is retrieved from the checkpoint record
successfully read.
In the case where a backup_label exists, the startup process already
fails if the redo record cannot be found after reading a checkpoint
record at the beginning of recovery.
Previously, the presence of the redo record was not checked. If the
redo and checkpoint records were located on different WAL segments, it
would be possible to miss a entire range of WAL records that should have
been replayed but were just ignored. The consequences of missing the
redo record depend on the version dealt with, these becoming worse the
older the version used:
- On HEAD, v18 and v17, recovery fails with a pointer dereference at the
beginning of the redo loop, as the redo record is expected but cannot be
found. These versions are good students, because we detect a failure
before doing anything, even if the failure is misleading in the shape of
a segmentation fault, giving no information that the redo record is
missing.
- In v16 and v15, problems show at the end of recovery within
FinishWalRecovery(), the startup process using a buggy LSN to decide
from where to start writing WAL. The cluster gets corrupted, still it
is noisy about it.
- v14 and older versions are worse: a cluster gets corrupted but it is
entirely silent about the matter. The redo record missing causes the
startup process to skip entirely recovery, because a missing record is
the same as not redo being required at all. This leads to data loss, as
everything is missed between the redo record and the checkpoint record.
Note that I have tested that down to 9.4, reproducing the issue with a
version of the author's reproducer slightly modified. The code is wrong
since at least 9.2, but I did not look at the exact point of origin.
This problem has been found by debugging a cluster where the WAL segment
including the redo segment was missing due to an operator error, leading
to a crash, based on an investigation in v15.
Requesting archive recovery with the creation of a recovery.signal or
a standby.signal even without a backup_label would mitigate the issue:
if the record cannot be found in pg_wal/, the missing segment can be
retrieved with a restore_command when checking that the redo record
exists. This was already the case without this commit, where recovery
would re-fetch the WAL segment that includes the redo record. The check
introduced by this commit makes the segment to be retrieved earlier to
make sure that the redo record can be found.
On HEAD, the code will be slightly changed in a follow-up commit to not
rely on a PANIC, to include a test able to emulate the original problem.
This is a minimal backpatchable fix, kept separated for clarity.
Reported-by: Andres Freund <andres@anarazel.de>
Analyzed-by: Andres Freund <andres@anarazel.de>
Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
Backpatch-through: 14
|
|
In the server, check explicitly for multixids with zero members. We
used to have an assertion for it, but commit d4b7bde418 replaced it
with more extensive runtime checks, but it missed the original case of
zero members.
In the upgrade code, a negative length never makes sense, so better
check for it explicitly. Commit d4b7bde418 added a similar sanity
check to the corresponding server code on master, and in backbranches,
the 'length' is passed to palloc which would fail with "invalid memory
alloc request size" error. Clarify the comments on what kind of
invalid entries are tolerated by the upgrade code and which ones are
reported as fatal errors.
Coverity complained about 'length' in the upgrade code being
tainted. That's bogus because we trust the data on disk at least to
some extent, but hopefully this will silence the complaint. If not,
I'll dismiss it manually.
Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi
|
|
The current list from the buildfarm includes quite a few typedef
names that it used to miss. The reason is a bit obscure, but it
seems likely to have something to do with our recent increased
use of palloc_object and palloc_array. In any case, this makes
the relevant struct declarations be much more nicely formatted,
so I'll take it. Install the current list and re-run pgindent
to update affected code.
Syncing with the current list also removes some obsolete
typedef names and fixes some alphabetization errors.
Discussion: https://postgr.es/m/1681301.1765742268@sss.pgh.pa.us
|
|
Change WAIT_LSN_TYPE_COUNT from an enum sentinel to a macro definition,
in a similar way to IOObject, IOContext, and BackendType enums. Remove
explicit enum value assignments well.
Author: Xuneng Zhou <xunengzhou@gmail.com>
|
|
Similar to commit 75f49221c22, it is preferable to use
StaticAssertDecl() instead of StaticAssertStmt() when possible.
Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com
|
|
Before this commit, when multixid wraparound happens,
MultiXactState->nextMXact goes to 0, which is invalid. All the readers
need to deal with that possibility and skip over the 0. That's
error-prone and we've missed it a few times in the past. This commit
changes the responsibility so that all the writers of
MultiXactState->nextMXact skip over the zero already, and readers can
trust that it's never 0.
We were already doing that for MultiXactState->oldestMultiXactId; none
of its writers would set it to 0. ReadMultiXactIdRange() was
nevertheless checking for that possibility. For clarity, remove that
check.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
|
|
It's not far-fetched that we'd try to read a multixid with an invalid
offset in case of bugs or corruption. Or if you call
pg_get_multixact_members() after a crash that left behind invalid but
unused multixids. Better to get a somewhat descriptive error message
if that happens.
Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
|
|
Oversight in commit bd8d9c9b.
Discussion: https://postgr.es/m/CAH2-WzmvwVKZ+0Z=RL_+g_aOku8QxWddDCXmtyLj02y+nYaD0g@mail.gmail.com
|
|
The idea is to encourage more the use of these new routines across the
tree, as these offer stronger type safety guarantees than palloc().
This batch of changes includes most of the trivial changes suggested by
the author for src/backend/.
A total of 334 files are updated here. Among these files, 48 of them
have their build change slightly; these are caused by line number
changes as the new allocation formulas are simpler, shaving around 100
lines of code in total.
Similar work has been done in 0c3c5c3b06a3 and 31d3847a37be.
Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
|
|
Author: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BFpmFf-hWXtrC0Q3Cr_Xo78zuP_M_VC5xgWPOYOkwqOD0T8eg@mail.gmail.com
|
|
This eliminates MultiXactOffset wraparound and the 2^32 limit on the
total number of multixid members. Multixids are still limited to 2^31,
but this is a nice improvement because 'members' can grow much faster
than the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.
Not having to deal with MultiXactOffset wraparound also simplifies the
code and removes some gnarly corner cases.
We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before. It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.
Add code to pg_upgrade to convert multitransactions from the old to
the new format, rewriting the pg_multixact SLRU files. Because
pg_upgrade now rewrites the files, we can get rid of some hacks we had
put in place to deal with old bugs and upgraded clusters. Bump catalog
version for the pg_multixact/offsets format change.
Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
|
|
This makes them accessible from pg_upgrade, needed by the next commit.
I'm doing this mechanical move as a separate commit to make the next
commit's changes to these definitions more obvious.
Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
|
|
There were a number of useless casts in format arguments, either
where the input to the cast was already in the right type, or
seemingly uselessly casting between types instead of just using the
right format placeholder to begin with.
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
|
|
The code in BootStrapXLOG() and in pg_test_fsync.c tried to align WAL
buffers in complicated ways. Also, they still used XLOG_BLCKSZ for
the alignment, even though that should now be PG_IO_ALIGN_SIZE. This
can now be simplified and made more consistent by using
PGAlignedXLogBlock, either directly in BootStrapXLOG() and using
alignas in pg_test_fsync.c.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/f462a175-b608-44a1-b428-bdf351e914f4%40eisentraut.org
|
|
In commit 789d65364c, we started updating the next multixid's offset
too when recording a multixid, so that it can always be used to
calculate the number of members. I got it wrong at offset wraparound:
we need to skip over offset 0. Fix that.
Discussion: https://www.postgresql.org/message-id/d9996478-389a-4340-8735-bfad456b313c@iki.fi
Backpatch-through: 14
|
|
The assertion checking MyProcNumber used MaxBackends as the upper
bound, but the procInfos array is allocated with size
MaxBackends + NUM_AUXILIARY_PROCS. This inconsistency would cause
a false assertion failure if an auxiliary process calls WaitForLSN().
Author: Xuneng Zhou <xunengzhou@gmail.com>
|
|
With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.
The waiting mechanism was broken in a few scenarios:
- When nextMulti was advanced without WAL-logging the next
multixid. For example, if a later multixid was already assigned and
WAL-logged before the previous one was WAL-logged, and then the
server crashed. In that case the next offset would never be set in
the offsets SLRU, and a query trying to read it would get stuck
waiting for it. Same thing could happen if pg_resetwal was used to
forcibly advance nextMulti.
- In hot standby mode, a deadlock could happen where one backend waits
for the next multixid assignment record, but WAL replay is not
advancing because of a recovery conflict with the waiting backend.
The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.
Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.
Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
|
|
This removes some casts where the input already has the same type as
the type specified by the cast. Their presence could cause risks of
hiding actual type mismatches in the future or silently discarding
qualifiers. It also improves readability. Same kind of idea as
7f798aca1d5 and ef8fe693606. (This does not change all such
instances, but only those hand-picked by the author.)
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
|
|
The input is ASCII anyway, so it's better to be clear that it's not
locale-dependent.
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
|
|
This split exists for most of the other RMGRs, and makes cleaner the
separation between the WAL code, the redo code and the record
description code (already in its own file) when it comes to the sequence
RMGR. The redo and masking routines are moved to a new file,
sequence_xlog.c. All the RMGR routines are now located in a new header,
sequence_xlog.h.
This separation is useful for a different patch related to sequences
that I have been working on, where it makes a refactoring of sequence.c
easier if its RMGR routines and its core routines are split.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/aSfTxIWjiXkTKh1E@paquier.xyz
|
|
We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time. However, one process
can wait only for one type of LSN at a time. So, no need for inHeap
and heapNode fields to be arrays.
Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
|
|
WaitLSNWakeup() incorrectly returned early when called with
InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path
check compared minWaitedLSN > 0 without validating currentLSN first.
This caused WAIT FOR LSN commands to wait indefinitely during standby
promotion until random signals woke them.
Add an XLogRecPtrIsValid() check before the comparison so that
InvalidXLogRecPtr bypasses the fast-path and wakes all waiters immediately.
Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
|
|
PostgreSQL's Windows port has never been able to handle files larger
than 2GB due to the use of off_t for file offsets, only 32-bit on
Windows. This causes signed integer overflow at exactly 2^31 bytes when
trying to handle files larger than 2GB, for the routines touched by this
commit.
Note that large files are forbidden by ./configure (3c6248a828af) and
meson (recent change, see 79cd66f28c65). This restriction also exists
in v16 and older versions for the now-dead MSVC scripts.
The code base already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
internals still rely on off_t. This commit switches more routines to
use pgoff_t, offering more portability, for areas mainly related to file
extensions and storage.
These are not critical for WAL segments yet, which have currently a
maximum size allowed of 1GB (well, this opens the door at allowing a
larger size for them). This matters more for segment files if we want
to lift the large file restriction in ./configure and meson in the
future, which would make sense to remove once/if all traces of off_t are
gone from the tree. This can additionally matter for out-of-core code
that may want files larger than 2GB in places where off_t is four bytes
in size.
Note that off_t is still used in other parts of the tree like
buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc.
These other code paths can be addressed separately, and their update
will be required if we want to remove the large file restriction in the
future. This commit is a good first cut in itself towards more
portability, hopefully.
On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.
Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com
|
|
It seems plausible that someone might want to experiment with
different values. The pressing reason though is that I'm reviewing a
patch that requires pg_upgrade to manipulate SLRU files. That patch
needs to access SLRU_PAGES_PER_SEGMENT from pg_upgrade code, and
slru.h, where SLRU_PAGES_PER_SEGMENT is currently defined, cannot be
included from frontend code. Moving it to pg_config_manual.h makes it
accessible.
Now that it's a little more likely that someone might change
SLRU_PAGES_PER_SEGMENT, add a cluster compatibility check for it.
Bump catalog version because of the new field in the control file.
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/c7a4ea90-9f7b-4953-81be-b3fcb47db057@iki.fi
|
|
We only need to do it for WAIT_LSN_TYPE_REPLAY. WAIT_LSN_TYPE_FLUSH can work
for both primary and follower.
|
|
Now that commit 06edbed47862 has introduced XLogRecPtrIsValid(), we can
use that instead of:
- XLogRecPtrIsInvalid()
- direct comparisons with InvalidXLogRecPtr
- direct comparisons with literal 0
This makes the code more consistent.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal
|
|
Various places that were using StringInfo but didn't need that
StringInfo to exist beyond the scope of the function were using
makeStringInfo(), which allocates both a StringInfoData and the buffer it
uses as two separate allocations. It's more efficient for these cases to
use a StringInfoData on the stack and initialize it with initStringInfo(),
which only allocates the string buffer. This also simplifies the cleanup,
in a few cases.
Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379aac8-26f1-42f2-a356-ff0e886228d3@gmail.com
|
|
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
|
|
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
|
|
These assertions may prove to become useful to make sure that no process
other than the startup process calls the routines where these checks are
added, as we expect that these do not interfere with a WAL receiver
switched to a "stopping" state by a startup process.
The assumption that only the startup process can use this code has
existed for many years, without a check enforcing it.
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/aQmGeVLYl51y1m_0@paquier.xyz
|
|
When the startup process switches from streaming to archive as WAL
source, we avoid calling ShutdownWalRcv() if the WAL receiver is not
streaming, based on WalRcvStreaming(). WALRCV_STOPPING is a state set
by ShutdownWalRcv(), called only by the startup process, meaning that it
should not be possible to reach this state while in
WaitForWALToBecomeAvailable().
This commit adds an assertion to make sure that a WAL receiver is never
in a WALRCV_STOPPING state should the startup process attempt to reset
InstallXLogFileSegmentActive.
Idea suggested by Noah Misch.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
|
|
Commit b4f584f9d2a1 (affecting v15~, later backpatched down to 13 as of
3635a0a35aaf) introduced an unconditional WAL receiver shutdown when
switching from streaming to archive WAL sources. This causes problems
during a timeline switch, when a WAL receiver enters WALRCV_WAITING
state but remains alive, waiting for instructions.
The unconditional shutdown can break some monitoring scenarios as the
WAL receiver gets repeatedly terminated and re-spawned, causing
pg_stat_wal_receiver.status to show a "streaming" instead of "waiting"
status, masking the fact that the WAL receiver is waiting for a new TLI
and a new LSN to be able to continue streaming.
This commit changes the WAL receiver behavior so as the shutdown becomes
conditional, with InstallXLogFileSegmentActive being always reset to
prevent the regression fixed by b4f584f9d2a1: only terminate the WAL
receiver when it is actively streaming (WALRCV_STREAMING,
WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state,
just reset InstallXLogFileSegmentActive flag to allow archive
restoration without killing the process. WALRCV_STOPPED and
WALRCV_STOPPING are not reachable states in this code path. For the
latter, the startup process is the one in charge of setting
WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to
reach a WALRCV_STOPPED state after switching walRcvState, so
WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is
in a WALRCV_STOPPING state.
A regression test is added to check that a WAL receiver is not stopped
on timeline jump, that fails when the fix of this commit is reverted.
Reported-by: Ryan Bird <ryanzxg@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Backpatch-through: 13
|
|
XLogRecordAssemble() may be called multiple times before inserting a
record in XLogInsertRecord(), and the amount of FPIs generated inside
a record whose insertion is attempted multiple times may vary.
The logic added in f9a09aa29520 touched directly pgWalUsage in
XLogRecordAssemble(), meaning that it could be possible for pgWalUsage
to be incremented multiple times for a single record. This commit
changes the code to use the same logic as the number of FPIs added to a
record, where XLogRecordAssemble() returns this information and feeds it
to XLogInsertRecord(), updating pgWalUsage only when a record is
inserted.
Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com
|
|
This new counter, called "wal_fpi_bytes", tracks the total amount in
bytes of full page images (FPIs) generated in WAL. This data becomes
available globally via pg_stat_wal, and for backend statistics via
pg_stat_get_backend_wal().
Previously, this information could only be retrieved with pg_waldump or
pg_walinspect, which may not be available depending on the environment,
and are expensive to execute. It offers hints about how much FPIs
impact the WAL generated, which could be a large percentage for some
workloads, as well as the effects of wal_compression or page holes.
Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in
PgStat_WalCounters.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
|
|
Previously, if primary_slot_name was set to an invalid slot name and
the configuration file was reloaded, both the postmaster and all other
backend processes reported a WARNING. With many processes running,
this could produce a flood of duplicate messages. The problem was that
the GUC check hook for primary_slot_name reported errors at WARNING
level via ereport().
This commit changes the check hook to use GUC_check_errdetail() and
GUC_check_errhint() for error reporting. As with other GUC parameters,
this causes non-postmaster processes to log the message at DEBUG3,
so by default, only the postmaster's message appears in the log file.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com
Backpatch-through: 13
|
|
These functions appeared prominently in a profile of a patch that sets
the visibility map on-access. Inline them to remove call overhead and
make them cheaper to use in hot paths.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
|
|
Issue found while browsing this area of the code, introduced and
copy-pasted around by 2258e76f90bf.
Backpatch-through: 15
|
|
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().
The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.
Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
|
|
Reported-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+TgmoZYh_nw-2j_Fi9y6ZAvrpN+W1aSOFNM7Rus2Q-zTkCsQw@mail.gmail.com
|
|
During recovery, XLogNeedsFlush() checks the minimum recovery LSN point
instead of the flush LSN point. The same condition checks are used when
updating the minimum recovery point in UpdateMinRecoveryPoint(), but are
written in reverse order.
This commit makes the order of the checks consistent between
XLogNeedsFlush() and UpdateMinRecoveryPoint(), improving the code
clarity. Note that the second check (as ordered by this commit) relies
on InRecovery, which is true only in the startup process. So this makes
XLogNeedsFlush() cheaper in the startup process with the first check
acting as a shortcut while doing crash recovery, where
LocalMinRecoveryPoint is an invalid LSN.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aMIHNRTP6Wj6vw1s%40paquier.xyz
|
|
When defining an injection point there is no need to wrap the definition
with USE_INJECTION_POINT guards, the INJECTION_POINT macro is available
in all builds. Remove to make the code consistent.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/OSCPR01MB14966C8015DEB05ABEF2CE077F51FA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
|
|
... which silently propagates a lot of headers into many places
via pgstat.h, as evidenced by the variety of headers that this patch
needs to add to seemingly random places. Add a minimum of typedefs to
conflict.h to be able to remove execnodes.h, and fix the fallout.
Backpatch to 18, where conflict.h first appeared.
Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql
|
|
This doesn't provide any value over the standard style of checking the
pointer directly or comparing against NULL.
Also remove related:
- AllocPointerIsValid() [unused]
- IndexScanIsValid() [had one user]
- HeapScanIsValid() [unused]
- InvalidRelation [unused]
Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(),
RelationIsValid for now, to reduce code churn.
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi
|
|
When deciding which code path to use depending on the state of recovery,
XLogFlush() and XLogNeedsFlush() have been relying on different
criterias:
- XLogFlush() relied on XLogInsertAllowed().
- XLogNeedsFlush() relied on RecoveryInProgress().
Currently, the checkpointer is allowed to insert WAL records while
RecoveryInProgress() returns true for an end-of-recovery checkpoint,
where XLogInsertAllowed() matters. Using RecoveryInProgress() in
XLogNeedsFlush() did not really matter for its existing callers, as the
checkpointer only called XLogFlush(). However, a feature under
discussion, by Melanie Plageman, needs XLogNeedsFlush() to be able to
work in more contexts, the end-of-recovery checkpoint being one.
This commit changes XLogNeedsFlush() to use XLogInsertAllowed() instead
of RecoveryInProgress(), making the checks in both routines more
consistent. While on it, an assertion based on XLogNeedsFlush() is
added at the end of XLogFlush(), triggered when flushing a physical
position (not for the normal recovery patch that checks for updates of
the minimum recovery point). This assertion would fail for example in
the recovery test 015_promotion_pages if XLogNeedsFlush() is changed to
use RecoveryInProgress(). This should be hopefully enough to ensure
that the checks done in both routines remain consistent.
Author: Melanie Plageman <melanieplageman@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com
|
|
The startup process does not process shared invalidation messages, only
sending them, and never calls AtEOXact_SMgr() which clean up any
unpinned SMgrRelations. Hence, it is never able to free SMgrRelations
on a periodic basis, bloating its hashtable over time.
Like the checkpointer and the bgwriter, this commit takes a conservative
approach by freeing periodically SMgrRelations when replaying a
checkpoint record, either online or shutdown, so as the startup process
has a way to perform a periodic cleanup.
Issue caused by 21d9c3ee4ef7, so backpatch down to v17.
Author: Jingtang Zhang <mrdrivingduck@gmail.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com
Backpatch-through: 17
|
|
A test has been added to ensure that conflict-relevant data is not
prematurely removed when a concurrent prepared transaction is being
committed on the publisher.
This test introduces an injection point that simulates the presence of a
prepared transaction in the commit phase, validating that the system
correctly delays conflict slot advancement until the transaction is fully
committed.
Additionally, the test serves as a safeguard for developers, ensuring that
the acquisition of the commit timestamp does not occur before marking
DELAY_CHKPT_IN_COMMIT in RecordTransactionCommitPrepared.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
|
|
This commit fixes three issues:
1) When a disabled subscription is created with retain_dead_tuples set to true,
the launcher is not woken up immediately, which may lead to delays in creating
the conflict detection slot.
Creating the conflict detection slot is essential even when the subscription is
not enabled. This ensures that dead tuples are retained, which is necessary for
accurately identifying the type of conflict during replication.
2) Conflict-related data was unnecessarily retained when the subscription does
not have a table.
3) Conflict-relevant data could be prematurely removed before applying
prepared transactions on the publisher that are in the commit critical section.
This issue occurred because the backend executing COMMIT PREPARED was not
accounted for during the computation of oldestXid in the commit phase on
the publisher. As a result, the subscriber could advance the conflict
slot's xmin without waiting for such COMMIT PREPARED transactions to
complete.
We fixed this issue by identifying prepared transactions that are in the
commit critical section during computation of oldestXid in commit phase.
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913DACB64E5721872AA5C02943BA@OS9PR01MB16913.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
|
|
SlruRecentlyUsed() is an inline function since 53c2a97a9266, not a
macro. The description of long_segment_names was missing at the top of
SimpleLruInit(), part forgotten in 4ed8f0913bfd.
Author: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/aLpBLMOYwEQkaleF@jrouhaud
Backpatch-through: 17
|