| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- mempool_alloc_bulk() support for upcoming users in the block layer
that need to allocate multiple objects at once with the mempool's
guaranteed progress semantics, which is not achievable with an
allocation single objects in a loop. Along with refactoring and
various improvements (Christoph Hellwig)
- Preparations for the upcoming separation of struct slab from struct
page, mostly by removing the struct folio layer, as the purpose of
struct folio has shifted since it became used in slab code (Matthew
Wilcox)
- Modernisation of slab's boot param API usage, which removes some
unexpected parsing corner cases (Petr Tesarik)
- Refactoring of freelist_aba_t (now struct freelist_counters) and
associated functions for double cmpxchg, enabled by -fms-extensions
(Vlastimil Babka)
- Cleanups and improvements related to sheaves caching layer, that were
part of the full conversion to sheaves, which is planned for the next
release (Vlastimil Babka)
* tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (42 commits)
slab: Remove unnecessary call to compound_head() in alloc_from_pcs()
mempool: clarify behavior of mempool_alloc_preallocated()
mempool: drop the file name in the top of file comment
mempool: de-typedef
mempool: remove mempool_{init,create}_kvmalloc_pool
mempool: legitimize the io_schedule_timeout in mempool_alloc_from_pool
mempool: add mempool_{alloc,free}_bulk
mempool: factor out a mempool_alloc_from_pool helper
slab: Remove references to folios from virt_to_slab()
kasan: Remove references to folio in __kasan_mempool_poison_object()
memcg: Convert mem_cgroup_from_obj_folio() to mem_cgroup_from_obj_slab()
mempool: factor out a mempool_adjust_gfp helper
mempool: add error injection support
mempool: improve kerneldoc comments
mm: improve kerneldoc comments for __alloc_pages_bulk
fault-inject: make enum fault_flags available unconditionally
usercopy: Remove folio references from check_heap_object()
slab: Remove folio references from kfree_nolock()
slab: Remove folio references from kfree_rcu_sheaf()
slab: Remove folio references from build_detached_freelist()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Improve the granularity of SELinux labeling for memfd files
Currently when creating a memfd file, SELinux treats it the same as
any other tmpfs, or hugetlbfs, file. While simple, the drawback is
that it is not possible to differentiate between memfd and tmpfs
files.
This adds a call to the security_inode_init_security_anon() LSM hook
and wires up SELinux to provide a set of memfd specific access
controls, including the ability to control the execution of memfds.
As usual, the commit message has more information.
- Improve the SELinux AVC lookup performance
Adopt MurmurHash3 for the SELinux AVC hash function instead of the
custom hash function currently used. MurmurHash3 is already used for
the SELinux access vector table so the impact to the code is minimal,
and performance tests have shown improvements in both hash
distribution and latency.
See the commit message for the performance measurments.
- Introduce a Kconfig option for the SELinux AVC bucket/slot size
While we have the ability to grow the number of AVC hash buckets
today, the size of the buckets (slot size) is fixed at 512. This pull
request makes that slot size configurable at build time through a new
Kconfig knob, CONFIG_SECURITY_SELINUX_AVC_HASH_BITS.
* tag 'selinux-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: improve bucket distribution uniformity of avc_hash()
selinux: Move avtab_hash() to a shared location for future reuse
selinux: Introduce a new config to make avc cache slot size adjustable
memfd,selinux: call security_inode_init_security_anon()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"These are the arm64 updates for 6.19.
The biggest part is the Arm MPAM driver under drivers/resctrl/.
There's a patch touching mm/ to handle spurious faults for huge pmd
(similar to the pte version). The corresponding arm64 part allows us
to avoid the TLB maintenance if a (huge) page is reused after a write
fault. There's EFI refactoring to allow runtime services with
preemption enabled and the rest is the usual perf/PMU updates and
several cleanups/typos.
Summary:
Core features:
- Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
Perf and PMU:
- Avoid cycle counter on multi-threaded CPUs
- Extend CSPMU device probing and add additional filtering support
for NVIDIA implementations
- Add support for the PMUs on the NoC S3 interconnect
- Add additional compatible strings for new Cortex and C1 CPUs
- Add support for data source filtering to the SPE driver
- Add support for i.MX8QM and "DB" PMU in the imx PMU driver
Memory managemennt:
- Avoid broadcast TLBI if page reused in write fault
- Elide TLB invalidation if the old PTE was not valid
- Drop redundant cpu_set_*_tcr_t0sz() macros
- Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
- Propagate return value from __change_memory_common()
ACPI and EFI:
- Call EFI runtime services without disabling preemption
- Remove unused ACPI function
Miscellaneous:
- ptrace support to disable streaming on SME-only systems
- Improve sysreg generation to include a 'Prefix' descriptor
- Replace __ASSEMBLY__ with __ASSEMBLER__
- Align register dumps in the kselftest zt-test
- Remove some no longer used macros/functions
- Various spelling corrections"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
arm64/pageattr: Propagate return value from __change_memory_common
arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
KVM: arm64: selftests: Consider all 7 possible levels of cache
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
Documentation/arm64: Fix the typo of register names
ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
perf: arm_spe: Add support for filtering on data source
perf: Add perf_event_attr::config4
perf/imx_ddr: Add support for PMU in DB (system interconnects)
perf/imx_ddr: Get and enable optional clks
perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
arm64: mm: use untagged address to calculate page index
MAINTAINERS: new entry for MPAM Driver
arm_mpam: Add kunit tests for props_mismatch()
arm_mpam: Add kunit test for bitmap reset
arm_mpam: Add helper to reset saved mbwu state
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Provide a new interface for dynamic configuration and deconfiguration
of hotplug memory, allowing with and without memmap_on_memory
support. This makes the way memory hotplug is handled on s390 much
more similar to other architectures
- Remove compat support. There shouldn't be any compat user space
around anymore, therefore get rid of a lot of code which also doesn't
need to be tested anymore
- Add stackprotector support. GCC 16 will get new compiler options,
which allow to generate code required for kernel stackprotector
support
- Merge pai_crypto and pai_ext PMU drivers into a new driver. This
removes a lot of duplicated code. The new driver is also extendable
and allows to support new PMUs
- Add driver override support for AP queues
- Rework and extend zcrypt and AP trace events to allow for tracing of
crypto requests
- Support block sizes larger than 65535 bytes for CCW tape devices
- Since the rework of the virtual kernel address space the module area
and the kernel image are within the same 4GB area. This eliminates
the need of weak per cpu variables. Get rid of
ARCH_MODULE_NEEDS_WEAK_PER_CPU
- Various other small improvements and fixes
* tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits)
watchdog: diag288_wdt: Remove KMSG_COMPONENT macro
s390/entry: Use lay instead of aghik
s390/vdso: Get rid of -m64 flag handling
s390/vdso: Rename vdso64 to vdso
s390: Rename head64.S to head.S
s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros
s390: Add stackprotector support
s390/modules: Simplify module_finalize() slightly
s390: Remove KMSG_COMPONENT macro
s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
s390/ap: Restrict driver_override versus apmask and aqmask use
s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex
s390/ap: Support driver_override for AP queue devices
s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks
s390/debug: Update description of resize operation
s390/syscalls: Switch to generic system call table generation
s390/syscalls: Remove system call table pointer from thread_struct
s390/uapi: Remove 31 bit support from uapi header files
s390: Remove compat support
tools: Remove s390 compat support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
"This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
common pattern of get_unused_fd_flags() + create file + fd_install()
that is used extensively throughout the kernel and currently requires
cumbersome cleanup paths.
FD_ADD() - For simple cases where a file is installed immediately:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return fd;
FD_PREPARE() - For cases requiring access to the fd or file, or
additional work before publishing:
FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
if (fdf.err) {
fput(sync_file->file);
return fdf.err;
}
data.fence = fd_prepare_fd(fdf);
if (copy_to_user((void __user *)arg, &data, sizeof(data)))
return -EFAULT;
return fd_publish(fdf);
The primitives are centered around struct fd_prepare. FD_PREPARE()
encapsulates all allocation and cleanup logic and must be followed by
a call to fd_publish() which associates the fd with the file and
installs it into the caller's fdtable. If fd_publish() isn't called,
both are deallocated automatically. FD_ADD() is a shorthand that does
fd_publish() immediately and never exposes the struct to the caller.
I've implemented this in a way that it's compatible with the cleanup
infrastructure while also being usable separately. IOW, it's centered
around struct fd_prepare which is aliased to class_fd_prepare_t and so
we can make use of all the basica guard infrastructure"
* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
io_uring: convert io_create_mock_file() to FD_PREPARE()
file: convert replace_fd() to FD_PREPARE()
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
tty: convert ptm_open_peer() to FD_ADD()
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
media: convert media_request_alloc() to FD_PREPARE()
hv: convert mshv_ioctl_create_partition() to FD_ADD()
gpio: convert linehandle_create() to FD_PREPARE()
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
spufs: convert spufs_gang_open() to FD_PREPARE()
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
spufs: convert spufs_context_open() to FD_PREPARE()
net/socket: convert __sys_accept4_file() to FD_ADD()
net/socket: convert sock_map_fd() to FD_ADD()
net/kcm: convert kcm_ioctl() to FD_PREPARE()
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
secretmem: convert memfd_secret() to FD_ADD()
memfd: convert memfd_create() to FD_ADD()
bpf: convert bpf_token_create() to FD_PREPARE()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-26-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-25-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"8 hotfixes. 4 are cc:stable, 7 are against mm/.
All are singletons - please see the respective changelogs for details"
* tag 'mm-hotfixes-stable-2025-11-26-11-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/filemap: fix logic around SIGBUS in filemap_map_pages()
mm/huge_memory: fix NULL pointer deference when splitting folio
MAINTAINERS: add test_kho to KHO's entry
mailmap: add entry for Sam Protsenko
selftests/mm: fix division-by-zero in uffd-unit-tests
mm/mmap_lock: reset maple state on lock_vma_under_rcu() retry
mm/memfd: fix information leak in hugetlb folios
mm: swap: remove duplicate nr_swap_pages decrement in get_swap_page_of_type()
|
|
Merges series "mempool_alloc_bulk and various mempool improvements v3"
from Christoph Hellwig.
From the cover letter [1]:
This series adds a bulk version of mempool_alloc that makes allocating
multiple objects deadlock safe.
The initial users is the blk-crypto-fallback code:
https://lore.kernel.org/linux-block/20251031093517.1603379-1-hch@lst.de/
with which v1 was posted, but I also have a few other users in mind.
Link: https://lore.kernel.org/all/20251113084022.1255121-1-hch@lst.de/ [1]
|
|
Merge series "slab: cmpxchg cleanups enabled by -fms-extensions"
From the cover letter [1]:
After learning about -fms-extensions being enabled for 6.19, I realized
there is some cleanup potential in slub code by extending the definition
and usage of freelist_aba_t, as it can now become an unnamed member of
struct slab. This series performs the cleanup, with no functional
changes intended. Additionally we turn freelist_aba_t to struct
freelist_counters as it doesn't meet any criteria for being a typedef,
per Documentation/process/coding-style.rst
Based on the tag kbuild-ms-extensions-6.19 from
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linuxV
Link: https://lore.kernel.org/all/20251107-slab-fms-cleanup-v1-0-650b1491ac9e@suse.cz/#t [1]
|
|
Merge series "Prepare slab for memdescs" by Matthew Wilcox.
From the cover letter [1]:
When we separate struct folio, struct page and struct slab from each
other, converting to folios then to slabs will be nonsense. It made
sense under the 'folio is just a head page' interpretation, but with
full separation, page_folio() will return NULL for a page which belongs
to a slab.
This patch series removes almost all mentions of folio from slab.
There are a few folio_test_slab() invocations left around the tree that
I haven't decided how to handle yet. We're not yet quite at the point
of separately allocating struct slab, but that's what I'll be working
on next.
Link: https://lore.kernel.org/all/20251113000932.1589073-1-willy@infradead.org/ [1]
|
|
Merge series "slab: preparatory cleanups before adding sheaves to all
caches" [1]
Cleanups that were written as part of the full sheaves conversion, which
is not fully ready yet, but they are useful on their own.
Link: https://lore.kernel.org/all/20251105-sheaves-cleanups-v1-0-b8218e1ac7ef@suse.cz/ [1]
|
|
Each page knows which node it belongs to, so there's no need to
convert to a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251124142329.1691780-1-willy@infradead.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
(inode_lru_list_del()) and similar routines for sb and io list
management
4. push list presence check into inode_lru_list_del(), just like sb and
io list
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In a recent commit, I inadvertently changed a comparison from being an
unsigned comparison (on 64-bit systems) to being a signed comparison
(which it had always been on 32-bit systems). This led to a sporadic
fstests failure.
To make sure this comparison is always unsigned, introduce a new type,
uoff_t which is the unsigned version of loff_t. Generally file sizes
are restricted to being a signed integer, but in these two places it is
convenient to pass -1 to indicate "up to the end of the file".
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251123220518.1447261-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Chris noticed that filemap_map_pages() calculates can_map_large only once
for the first page in the fault around range. The value is not valid for
the following pages in the range and must be recalculated.
Instead of recalculating can_map_large on each iteration, pass down
file_end to filemap_map_folio_range() and let it make the decision on what
can be mapped.
Link: https://lkml.kernel.org/r/20251120161411.859078-1-kirill@shutemov.name
Fixes: 74207de2ba10 ("mm/memory: do not populate page table entries beyond i_size")h
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit c010d47f107f ("mm: thp: split huge page to any lower order pages")
introduced an early check on the folio's order via mapping->flags before
proceeding with the split work.
This check introduced a bug: for shmem folios in the swap cache and
truncated folios, the mapping pointer can be NULL. Accessing
mapping->flags in this state leads directly to a NULL pointer dereference.
This commit fixes the issue by moving the check for mapping != NULL before
any attempt to access mapping->flags.
Link: https://lkml.kernel.org/r/20251119235302.24773-1-richard.weiyang@gmail.com
Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The retry in lock_vma_under_rcu() drops the rcu read lock before
reacquiring the lock and trying again. This may cause a use-after-free if
the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is
dropped, the state cannot be reused as it tracks pointers to objects that
may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be
invalidated. Resetting the address and state to MA_START is the safest
course of action, which will result in the next operation starting from
the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU
lock on failure"), vma_start_read() would drop rcu read lock and return
NULL, so the retry would not have happened. However, now that
vma_start_read() drops rcu read lock on failure followed by a retry, we
may end up using a freed maple tree node cached in the maple state.
[surenb@google.com: changelog alteration]
Link: https://lkml.kernel.org/r/CAJuCfpEWMD-Z1j=nPYHcQW4F7E2Wka09KTXzGv7VE7oW1S8hcw@mail.gmail.com
Link: https://lkml.kernel.org/r/20251111215605.1721380-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure")
Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When allocating hugetlb folios for memfd, three initialization steps are
missing:
1. Folios are not zeroed, leading to kernel memory disclosure to userspace
2. Folios are not marked uptodate before adding to page cache
3. hugetlb_fault_mutex is not taken before hugetlb_add_to_page_cache()
The memfd allocation path bypasses the normal page fault handler
(hugetlb_no_page) which would handle all of these initialization steps.
This is problematic especially for udmabuf use cases where folios are
pinned and directly accessed by userspace via DMA.
Fix by matching the initialization pattern used in hugetlb_no_page():
- Zero the folio using folio_zero_user() which is optimized for huge pages
- Mark it uptodate with folio_mark_uptodate()
- Take hugetlb_fault_mutex before adding to page cache to prevent races
The folio_zero_user() change also fixes a potential security issue where
uninitialized kernel memory could be disclosed to userspace through read()
or mmap() operations on the memfd.
Link: https://lkml.kernel.org/r/20251112145034.2320452-1-kartikey406@gmail.com
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/20251112031631.2315651-1-kartikey406@gmail.com/ [v1]
Closes: https://syzkaller.appspot.com/bug?extid=f64019ba229e3a5c411b
Suggested-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com> (v2)
Cc: Christoph Hellwig <hch@lst.de> (v6)
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After commit 4f78252da887, nr_swap_pages is decremented in
swap_range_alloc(). Since cluster_alloc_swap_entry() calls
swap_range_alloc() internally, the decrement in get_swap_page_of_type()
causes double-decrementing.
As a representative userspace-visible runtime example of the impact,
/proc/meminfo reports increasingly inaccurate SwapFree values. The
discrepancy grows with each swap allocation, and during hibernation
when large amounts of memory are written to swap, the reported value
can deviate significantly from actual available swap space, misleading
users and monitoring tools.
Remove the duplicate decrement.
Link: https://lkml.kernel.org/r/20251102082456.79807-1-youngjun.park@lge.com
Fixes: 4f78252da887 ("mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Kairui Song <kasong@tencent.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The documentation of that function promises to never sleep. However on
PREEMPT_RT a spinlock_t might in fact sleep.
Reword the documentation so users can predict its behavior better.
mempool could also replace spinlock_t with raw_spinlock_t which doesn't
sleep even on PREEMPT_RT but that would take away the improved
preemptibility of sleeping locks.
Link: https://lkml.kernel.org/r/20251014-mempool-doc-v1-1-bc9ebf169700@linutronix.de
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Mentioning the name of the file is redundant, so drop it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-12-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Switch all uses of the deprecated mempool_t typedef in the core mempool
code to use struct mempool instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-11-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
This was added for bcachefs and is unused now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-10-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The timeout here is and old workaround with a Fixme comment. But
thinking about it, it makes sense to keep it, so reword the comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-9-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add a version of the mempool allocator that works for batch allocations
of multiple objects. Calling mempool_alloc in a loop is not safe because
it could deadlock if multiple threads are performing such an allocation
at the same time.
As an extra benefit the interface is build so that the same array can be
used for alloc_pages_bulk / release_pages so that at least for page
backed mempools the fast path can use a nice batch optimization.
Note that mempool_alloc_bulk does not take a gfp_mask argument as it
must always be able to sleep and doesn't support any non-trivial
modifiers. NOFO or NOIO constrainst must be set through the scoped API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-8-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add a helper for the mempool_alloc slowpath to better separate it from the
fast path, and also use it to implement mempool_alloc_preallocated which
shares the same logic.
[hughd@google.com: fix lack of retrying with __GFP_DIRECT_RECLAIM]
[vbabka@suse.cz: really use limited flags for first mempool attempt]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-7-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- Fix mempool poisoning order>0 pages with CONFIG_HIGHMEM (Vlastimil Babka)
* tag 'slab-for-6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/mempool: fix poisoning order>0 pages with HIGHMEM
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"Fix memblock_estimated_nr_free_pages() for soft-reserved memory
The "soft-reserved" memory regions (EFI_MEMORY_SP) are added to the
memblock.reserved, but not to the memblock.memory. It causes
memblock_estimated_nr_free_pages() to return a value smaller value
than expected, or if it underflows, an extremely large value.
Calculate the number of estimated free pages using
memblock_reserved_kern_size() instead of memblock_reserved_size() to
fix the issue"
* tag 'fixes-2025-11-19' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: fix memblock_estimated_nr_free_pages() for soft-reserved memory
|
|
The page faults may be spurious because of the racy access to the page
table. For example, a non-populated virtual page is accessed on 2
CPUs simultaneously, thus the page faults are triggered on both CPUs.
However, it's possible that one CPU (say CPU A) cannot find the reason
for the page fault if the other CPU (say CPU B) has changed the page
table before the PTE is checked on CPU A. Most of the time, the
spurious page faults can be ignored safely. However, if the page
fault is for the write access, it's possible that a stale read-only
TLB entry exists in the local CPU and needs to be flushed on some
architectures. This is called the spurious page fault fixing.
In the current kernel, there is spurious fault fixing support for pte,
but not for huge pmd because no architectures need it. But in the
next patch in the series, we will change the write protection fault
handling logic on arm64, so that some stale huge pmd entries may
remain in the TLB. These entries need to be flushed via the huge pmd
spurious fault fixing mechanism.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The recent fix to properly initialize the tags of the huge zero folio
had an unfortunate not-so-subtle side effect: it caused the actual
*contents* of the huge zero folio to not be initialized at all when the
hardware didn't support the memory tagging.
The reason was the unfortunate semantics of tag_clear_highpage(): on
hardware that didn't do the tagging, it would silently just not do
anything at all. And since this is done only on arm64 with MTE support,
that basically meant most hardware.
It wasn't necessarily immediately obvious since the huge zero page isn't
necessarily very heavily used - or because it might already be zero
because all-zeroes is the most common pattern. But it ends up causing
random odd user space failures when you do hit it.
The unfortunate semantics have been around for a while, but became a
real bug only when we started actively using __GFP_ZEROTAGS in the
generic get_huge_zero_folio() function - before that, it had only ever
been used in code that checked that the hardware supported it.
Fix this by simply changing the semantics of tag_clear_highpage() to
return whether it actually successfully did something or not. While at
it, also make it initialize multiple pages in one go, since that's
actually what the only caller wants it to do and it simplifies the whole
logic.
Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio")
Link: https://lore.kernel.org/all/20251117082023.90176-1-00107082@163.com/
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reported-and-tested-by: David Wang <00107082@163.com>
Reported-and-tested-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix unitialized variable in statmount_string()
- Fix hostfs mounting when passing host root during boot
- Fix dynamic lookup to fail on cell lookup failure
- Fix missing file type when reading bfs inodes from disk
- Enforce checking of sb_min_blocksize() calls and update all callers
accordingly
- Restore write access before closing files opened by open_exec() in
binfmt_misc
- Always freeze efivarfs during suspend/hibernate cycles
- Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
actually allow mount namespace file descriptor in addition to mount
namespace ids
- Fix tmpfs remount when noswap is specified
- Switch Landlock to iput_not_last() to remove false-positives from
might_sleep() annotations in iput()
- Remove dead node_to_mnt_ns() code
- Ensure that per-queue kobjects are successfully created
* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
landlock: fix splats from iput() after it started calling might_sleep()
fs: add iput_not_last()
shmem: fix tmpfs reconfiguration (remount) when noswap is set
fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
power: always freeze efivarfs
binfmt_misc: restore write access before closing files opened by open_exec()
block: add __must_check attribute to sb_min_blocksize()
virtio-fs: fix incorrect check for fsvq->kobj
xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
isofs: check the return value of sb_min_blocksize() in isofs_fill_super
exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
vfat: fix missing sb_min_blocksize() return value checks
mnt: Remove dead code which might prevent from building
bfs: Reconstruct file type when loading from disk
afs: Fix dynamic lookup to fail on cell lookup failure
hostfs: Fix only passing host root in boot stage with new mount
fs: Fix uninitialized 'offp' in statmount_string()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 hotfixes. 5 are cc:stable, 4 are against mm/
All are singletons - please see the respective changelogs for details"
* tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm, swap: fix potential UAF issue for VMA readahead
selftests/user_events: fix type cast for write_index packed member in perf_test
lib/test_kho: check if KHO is enabled
mm/huge_memory: fix folio split check for anon folios in swapcache
MAINTAINERS: update David Hildenbrand's email address
crash: fix crashkernel resource shrink
mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
|
|
Since commit 78524b05f1a3 ("mm, swap: avoid redundant swap device
pinning"), the common helper for allocating and preparing a folio in the
swap cache layer no longer tries to get a swap device reference
internally, because all callers of __read_swap_cache_async are already
holding a swap entry reference. The repeated swap device pinning isn't
needed on the same swap device.
Caller of VMA readahead is also holding a reference to the target entry's
swap device, but VMA readahead walks the page table, so it might encounter
swap entries from other devices, and call __read_swap_cache_async on
another device without holding a reference to it.
So it is possible to cause a UAF when swapoff of device A raced with
swapin on device B, and VMA readahead tries to read swap entries from
device A. It's not easy to trigger, but in theory, it could cause real
issues.
Make VMA readahead try to get the device reference first if the swap
device is a different one from the target entry.
Link: https://lkml.kernel.org/r/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com
Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both uniform and non uniform split check missed the check to prevent
splitting anon folios in swapcache to non-zero order.
Splitting anon folios in swapcache to non-zero order can cause data
corruption since swapcache only support PMD order and order-0 entries.
This can happen when one use split_huge_pages under debugfs to split
anon folios in swapcache.
In-tree callers do not perform such an illegal operation. Only debugfs
interface could trigger it. I will put adding a test case on my TODO
list.
Fix the check.
Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com
Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: "David Hildenbrand (Red Hat)" <david@kernel.org>
Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support
runtime allocation of gigantic hugetlb folios. In the meantime it evolved
into a generic way for the architecture to state that it supports gigantic
hugetlb folios.
In commit fae7d834c43c ("mm: add __dump_folio()") we started using
CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could
have folios larger than what the buddy can handle. In the context of that
commit, we started using MAX_FOLIO_ORDER to detect page corruptions when
dumping tail pages of folios. Before that commit, we assumed that we
cannot have folios larger than the highest buddy order, which was
obviously wrong.
In commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes
when registering hstate"), we used MAX_FOLIO_ORDER to detect
inconsistencies, and in fact, we found some now.
Powerpc allows for configs that can allocate gigantic folio during boot
(not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can
exceed PUD_ORDER.
To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with
hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16
GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit
(powerpc). Note that on some powerpc configurations, whether we actually
have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER,
but there is nothing really problematic about setting it unconditionally:
we just try to keep the value small so we can better detect problems in
__dump_folio() and inconsistencies around the expected largest folio in
the system.
Ideally, we'd have a better way to obtain the maximum hugetlb folio size
and detect ourselves whether we really end up with gigantic folios. Let's
defer bigger changes and fix the warnings first.
While at it, handle gigantic DAX folios more clearly: DAX can only end up
creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD.
Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer.
In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE.
Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now
also allow for runtime allocations of folios in some more powerpc configs.
I don't think this is a problem, but if it is we could handle it through
__HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED.
While __dump_page()/__dump_folio was also problematic (not handling
dumping of tail pages of such gigantic folios correctly), it doesn't seem
critical enough to mark it as a fix.
Link: https://lkml.kernel.org/r/20251114214920.2550676-1-david@kernel.org
Fixes: 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Closes: https://lore.kernel.org/r/3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu/
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/r/94377f5c-d4f0-4c0f-b0f6-5bf1cd7305b1@linux.ibm.com/
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel test has reported:
BUG: unable to handle page fault for address: fffba000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pde = 03171067 *pte = 00000000
Oops: Oops: 0002 [#1]
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.18.0-rc2-00031-gec7f31b2a2d3 #1 NONE a1d066dfe789f54bc7645c7989957d2bdee593ca
Tainted: [T]=RANDSTRUCT
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
EIP: memset (arch/x86/include/asm/string_32.h:168 arch/x86/lib/memcpy_32.c:17)
Code: a5 8b 4d f4 83 e1 03 74 02 f3 a4 83 c4 04 5e 5f 5d 2e e9 73 41 01 00 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 89 c6 89 d0 89 f7 <f3> aa 89 f0 5e 5f 5d 2e e9 53 41 01 00 cc cc cc 55 89 e5 53 57 56
EAX: 0000006b EBX: 00000015 ECX: 001fefff EDX: 0000006b
ESI: fffb9000 EDI: fffba000 EBP: c611fbf0 ESP: c611fbe8
DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010287
CR0: 80050033 CR2: fffba000 CR3: 0316e000 CR4: 00040690
Call Trace:
poison_element (mm/mempool.c:83 mm/mempool.c:102)
mempool_init_node (mm/mempool.c:142 mm/mempool.c:226)
mempool_init_noprof (mm/mempool.c:250 (discriminator 1))
? mempool_alloc_pages (mm/mempool.c:640)
bio_integrity_initfn (block/bio-integrity.c:483 (discriminator 8))
? mempool_alloc_pages (mm/mempool.c:640)
do_one_initcall (init/main.c:1283)
Christoph found out this is due to the poisoning code not dealing
properly with CONFIG_HIGHMEM because only the first page is mapped but
then the whole potentially high-order page is accessed.
We could give up on HIGHMEM here, but it's straightforward to fix this
with a loop that's mapping, poisoning or checking and unmapping
individual pages.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511111411.9ebfa1ba-lkp@intel.com
Analyzed-by: Christoph Hellwig <hch@lst.de>
Fixes: bdfedb76f4f5 ("mm, mempool: poison elements backed by slab allocator")
Cc: stable@vger.kernel.org
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113-mempool-poison-v1-1-233b3ef984c3@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- Fix memory leak of objects from remote NUMA node when bulk freeing to
a cache with sheaves (Harry Yoo)
* tag 'slab-for-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: fix memory leak in free_to_pcs_bulk()
|
|
Use page_slab() instead of virt_to_folio() which will work
perfectly when struct slab is separated from struct folio.
This was the last user of folio_slab(), so delete it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-17-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In preparation for splitting struct slab from struct page and struct
folio, remove mentions of struct folio from this function. There is a
mild improvement for large kmalloc objects as we will avoid calling
compound_head() for them. We can discard the comment as using
PageLargeKmalloc() rather than !folio_test_slab() makes it obvious.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Link: https://patch.msgid.link/20251113000932.1589073-16-willy@infradead.org
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In preparation for splitting struct slab from struct page and struct
folio, convert the pointer to a slab rather than a folio. This means
we can end up passing a NULL slab pointer to mem_cgroup_from_obj_slab()
if the pointer is not to a page allocated to slab, and we handle that
appropriately by returning NULL.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: cgroups@vger.kernel.org
Link: https://patch.msgid.link/20251113000932.1589073-15-willy@infradead.org
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The commit 989b09b73978 ("slab: skip percpu sheaves for remote object
freeing") introduced the remote_objects array in free_to_pcs_bulk() to
skip sheaves when objects from a remote node are freed.
However, the array is flushed only when:
1) the array becomes full (++remote_nr >= PCS_BATCH_MAX), or
2) slab_free_hook() returns false and size becomes zero.
When neither of the conditions is met, objects in the array are leaked.
This resulted in a memory leak [1], where 82 GiB of memory was allocated
for the maple_node cache.
Flush the array after successfully freeing objects to sheaves
in the do_free: path.
In the meantime, move the snippet if (!size) goto flush_remote; outside
the while loop for readability. Let's say all objects in the array are
from a remote node: then we acquire s->cpu_sheaves->lock and try to free
an object even when size is zero. This doesn't appear to be harmful,
but isn't really readable.
Reported-by: Tytus Rogalewski <admin@simplepod.ai>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220765 [1]
Closes: https://lore.kernel.org/linux-mm/20251107094809.12e9d705b7bf4815783eb184@linux-foundation.org
Closes: https://lore.kernel.org/all/aRGDTwbt2EIz2CYn@hyeyoo
Fixes: 989b09b73978 ("slab: skip percpu sheaves for remote object freeing")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251111125331.12246-1-harry.yoo@oracle.com
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Tytus Rogalewski <admin@simplepod.ai>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add a helper to better isolate and document the gfp flags adjustments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-6-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add a call to should_fail_ex that forces mempool to actually allocate
from the pool to stress the mempool implementation when enabled through
debugfs. By default should_fail{,_ex} prints a very verbose stack trace
that clutters the kernel log, slows down execution and triggers the
kernel bug detection in xfstests. Pass FAULT_NOWARN and print a
single-line message notating the caller instead so that full tests
can be run with fault injection.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20251113084022.1255121-5-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Use proper formatting, use full sentences and reduce some verbosity in
function parameter descriptions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-4-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Describe the semantincs in more detail, as the filling empty slots in
an array scheme is not quite obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-3-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Use page_slab() instead of virt_to_folio() followed by folio_slab().
We do end up calling compound_head() twice for non-slab copies, but that
will not be a problem once we allocate memdescs separately.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: linux-hardening@vger.kernel.org
Link: https://patch.msgid.link/20251113000932.1589073-14-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In preparation for splitting struct slab from struct page and struct
folio, remove mentions of struct folio from this function. Since large
kmalloc objects are not supported here, we can just use virt_to_slab().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-13-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In preparation for splitting struct slab from struct page and struct
folio, remove mentions of struct folio from this function. Since
we don't need to handle large kmalloc objects specially here, we
can just use virt_to_slab().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-12-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Use pages and slabs directly instead of converting to folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-11-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
One slight tweak I made is to calculate 'ks' earlier, which means we
can reuse it in the warning rather than calculating the object size twice.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-10-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
This should generate identical code to the previous version, but
without any dependency on how folios work.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-9-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Remove conversions from folio to page and folio to slab. This is
preparation for separately allocated struct slab from struct page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-8-willy@infradead.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
There's no need to use folio APIs here; just use a page directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251113000932.1589073-7-willy@infradead.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
There's no need to use folio APIs here; just use a page directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251113000932.1589073-6-willy@infradead.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Use pages directly to further the split between slab and folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251113000932.1589073-5-willy@infradead.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
This allows us to skip the compound_head() call for large kmalloc
objects as the virt_to_page() call will always give us the head page
for the large kmalloc case.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251113000932.1589073-4-willy@infradead.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In the future, we will separate slab, folio and page from each other
and calling virt_to_folio() on an address allocated from slab will
return NULL. Delay the conversion from struct page to struct slab
until we know we're not dealing with a large kmalloc allocation.
There's a minor win for large kmalloc allocations as we avoid the
compound_head() hidden in virt_to_folio().
This deprecates calling ksize() on memory allocated by alloc_pages().
Today it becomes a warning and support will be removed entirely in
the future.
Introduce large_kmalloc_size() to abstract how we represent the size
of a large kmalloc allocation. For now, this is the same as
page_size(), but it will change with separately allocated memdescs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-3-willy@infradead.org
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In order to separate slabs from folios, we need to convert from any page
in a slab to the slab directly without going through a page to folio
conversion first.
Up to this point, page_slab() has followed the example of other memdesc
converters (page_folio(), page_ptdesc() etc) and just cast the pointer
to the requested type, regardless of whether the pointer is actually a
pointer to the correct type or not.
That changes with this commit; we check that the page actually belongs
to a slab and return NULL if it does not. Other memdesc converters will
adopt this convention in future.
kfence was the only user of page_slab(), so adjust it to the new way
of working. It will need to be touched again when we separate slab
from page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: kasan-dev@googlegroups.com
Link: https://patch.msgid.link/20251113000932.1589073-2-willy@infradead.org
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In barn_shrink(), use LIST_HEAD() to declare and initialize the
list_head in one step instead of using INIT_LIST_HEAD() separately.
No functional change.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In functions such as [__]slab_update_freelist() and
__slab_update_freelist_fast/slow() we pass old and new freelist and
counters as 4 separate parameters. The underlying
__update_freelist_fast() then constructs struct freelist_counters
variables for passing the full freelist+counter combinations to cmpxchg
double.
In most cases we actually start with struct freelist_counters variables,
but then pass the individual fields, only to construct new struct
freelist_counters variables. While it's all inlined and thus should be
efficient, we can simplify this code.
Thus replace the 4 parameters for individual fields with two pointers to
struct freelist_counters wherever applicable. __update_freelist_fast()
can then pass them directly to try_cmpxchg_freelist().
The code is also more obvious as the pattern becomes unified such that
we set up "old" and "new" struct freelist_counters variables upfront as
we fully need them to be, and simply call [__]slab_update_freelist() on
them. Previously some of the "new" values would be hidden among the
many parameters and thus make it harder to figure out what the code
does.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In systemd we're trying to switch the internal credentials setup logic
to new mount API [1], and I noticed fsconfig(FSCONFIG_CMD_RECONFIGURE)
consistently fails on tmpfs with noswap option. This can be trivially
reproduced with the following:
```
int fs_fd = fsopen("tmpfs", 0);
fsconfig(fs_fd, FSCONFIG_SET_FLAG, "noswap", NULL, 0);
fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fsmount(fs_fd, 0, 0);
fsconfig(fs_fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); <------ EINVAL
```
After some digging the culprit is shmem_reconfigure() rejecting
!(ctx->seen & SHMEM_SEEN_NOSWAP) && sbinfo->noswap, which is bogus
as ctx->seen serves as a mask for whether certain options are touched
at all. On top of that, noswap option doesn't use fsparam_flag_no,
hence it's not really possible to "reenable" swap to begin with.
Drop the check and redundant SHMEM_SEEN_NOSWAP flag.
[1] https://github.com/systemd/systemd/pull/39637
Fixes: 2c6efe9cf2d7 ("shmem: add support to ignore swap")
Signed-off-by: Mike Yuan <me@yhndnzj.com>
Link: https://patch.msgid.link/20251108190930.440685-1-me@yhndnzj.com
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
memblock_estimated_nr_free_pages() returns the difference between the total
size of the "memory" memblock type and the "reserved" memblock type.
The "soft-reserved" memory regions are added to the "reserved" memblock
type, but not to the "memory" memblock type. Therefore,
memblock_estimated_nr_free_pages() may return a smaller value than
expected, or if it underflows, an extremely large value.
/proc/sys/kernel/threads-max is determined by the value of
memblock_estimated_nr_free_pages(). This issue was discovered on machines
with CXL memory because kernel.threads-max was either smaller than expected
or extremely large for the installed DRAM size.
This fixes the issue by replacing memblock_reserved_size() with
memblock_reserved_kern_size() that tells how much memory was
reserved from the actual RAM.
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://patch.msgid.link/20251111010010.7800-1-akinobu.mita@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
In several functions we declare local struct slab variables so we can
work with the freelist and counters fields (including the sub-counters
that are in the union) comfortably.
With struct freelist_counters containing the full counters definition,
we can now reduce the local variables to that type as we don't need the
other fields in struct slab.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In struct slab we currently have freelist and counters pair, where
counters itself is a union of unsigned long with a sub-struct of
several smaller fields. Then for the usage with double cmpxchg we have
freelist_aba_t that duplicates the definition of the freelist+counters
with implicitly the same layout as the full definition in struct slab.
Thanks to -fms-extension we can now move the full counters definition to
freelist_aba_t (while changing it to struct freelist_counters as a
typedef is unnecessary and discouraged) and replace the relevant part in
struct slab to an unnamed reference to it.
The immediate benefit is the removal of duplication and no longer
relying on the same layout implicitly. It also allows further cleanups
thanks to having the full definition of counters in struct
freelist_counters.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In kmem_cache_cpu we currently have a union of the freelist+tid pair
with freelist_aba_t, relying implicitly on the type compatibility with the
freelist+counters pair used in freelist_aba_t.
To allow further changes to freelist_aba_t, we can instead define a
separate struct freelist_tid (instead of a typedef, per the coding
style) for kmem_cache_cpu, as that affects only a single helper
__update_cpu_freelist_fast().
We can add the resulting struct freelist_tid to kmem_cache_cpu as
unnamed field thanks to -fms-extensions, so that freelist and tid fields
can still be accessed directly.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When a page fault occurs in a secret memory file created with
`memfd_secret(2)`, the kernel will allocate a new folio for it, mark the
underlying page as not-present in the direct map, and add it to the file
mapping.
If two tasks cause a fault in the same page concurrently, both could end
up allocating a folio and removing the page from the direct map, but only
one would succeed in adding the folio to the file mapping. The task that
failed undoes the effects of its attempt by (a) freeing the folio again
and (b) putting the page back into the direct map. However, by doing
these two operations in this order, the page becomes available to the
allocator again before it is placed back in the direct mapping.
If another task attempts to allocate the page between (a) and (b), and the
kernel tries to access it via the direct map, it would result in a
supervisor not-present page fault.
Fix the ordering to restore the direct map before the folio is freed.
Link: https://lkml.kernel.org/r/20251031120955.92116-1-lance.yang@linux.dev
Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas")
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com>
Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On arm64 with MTE enabled, a page mapped as Normal Tagged (PROT_MTE) in
user space will need to have its allocation tags initialised. This is
normally done in the arm64 set_pte_at() after checking the memory
attributes. Such page is also marked with the PG_mte_tagged flag to avoid
subsequent clearing. Since this relies on having a struct page,
pte_special() mappings are ignored.
Commit d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero
folio special") maps the huge zero folio special and the arm64
set_pmd_at() will no longer zero the tags. There is no guarantee that the
tags are zero, especially if parts of this huge page have been previously
tagged.
It's fairly easy to detect this by regularly dropping the caches to
force the reallocation of the huge zero folio.
Allocate the huge zero folio with the __GFP_ZEROTAGS flag. In addition,
do not warn in the arm64 __access_remote_tags() when reading tags from the
huge zero page.
I bundled the arm64 change in here as well since they are both related to
the commit mapping the huge zero folio as special.
[catalin.marinas@arm.com: handle arch mte_zero_clear_page_tags() code issuing MTE instructions]
Link: https://lkml.kernel.org/r/aQi8dA_QpXM8XqrE@arm.com
Link: https://lkml.kernel.org/r/20251031170133.280742-1-catalin.marinas@arm.com
Fixes: d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Beleswar Padhi <b-padhi@ti.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In DAMON's damon_sysfs_repeat_call_fn(), time_before() is used to compare
the current jiffies with next_update_jiffies to determine whether to
update the sysfs files at this moment.
On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make
jiffies wrap bugs appear earlier. However, this causes time_before() in
damon_sysfs_repeat_call_fn() to unexpectedly return true during the first
5 minutes after boot on 32-bit systems (see [1] for more explanation,
which fixes another jiffies-related issue before). As a result, DAMON
does not update sysfs files during that period.
There is also an issue unrelated to the system's word size[2]: if the
user stops DAMON just after next_update_jiffies is updated and restarts
it after 'refresh_ms' or a longer delay, next_update_jiffies will retain
an older value, causing time_before() to return false and the update to
happen earlier than expected.
Fix these issues by making next_update_jiffies a global variable and
initializing it each time DAMON is started.
Link: https://lkml.kernel.org/r/20251030020746.967174-3-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1]
Link: https://lore.kernel.org/all/20251029013038.66625-1-sj@kernel.org/ [2]
Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work")
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: fixes for the jiffies-related issues", v2.
On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make
jiffies wrap bugs appear earlier. However, this may cause the
time_before() series of functions to return unexpected values, resulting
in DAMON not functioning as intended. Meanwhile, similar issues exist in
some specific user operation scenarios.
This patchset addresses these issues. The first patch is about the
DAMON_STAT module, and the second patch is about the core layer's sysfs.
This patch (of 2):
In DAMON_STAT's damon_stat_damon_call_fn(), time_before_eq() is used to
avoid unnecessarily frequent stat update.
On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make
jiffies wrap bugs appear earlier. However, this causes time_before_eq()
in DAMON_STAT to unexpectedly return true during the first 5 minutes after
boot on 32-bit systems (see [1] for more explanation, which fixes another
jiffies-related issue before). As a result, DAMON_STAT does not update
any monitoring results during that period, which becomes more confusing
when DAMON_STAT_ENABLED_DEFAULT is enabled.
There is also an issue unrelated to the system's word size[2]: if the user
stops DAMON_STAT just after last_refresh_jiffies is updated and restarts
it after 5 seconds or a longer delay, last_refresh_jiffies will retain an
older value, causing time_before_eq() to return false and the update to
happen earlier than expected.
Fix these issues by making last_refresh_jiffies a global variable and
initializing it each time DAMON_STAT is started.
Link: https://lkml.kernel.org/r/20251030020746.967174-2-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1]
Link: https://lore.kernel.org/all/20251028143250.50144-1-sj@kernel.org/ [2]
Fixes: fabdd1e911da ("mm/damon/stat: calculate and expose estimated memory bandwidth")
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
slabobj_ext
When alloc_slab_obj_exts() fails and then later succeeds in allocating a
slab extension vector, it calls handle_failed_objexts_alloc() to mark all
objects in the vector as empty. As a result all objects in this slab
(slabA) will have their extensions set to CODETAG_EMPTY.
Later on if this slabA is used to allocate a slabobj_ext vector for
another slab (slabB), we end up with the slabB->obj_exts pointing to a
slabobj_ext vector that itself has a non-NULL slabobj_ext equal to
CODETAG_EMPTY. When slabB gets freed, free_slab_obj_exts() is called to
free slabB->obj_exts vector.
free_slab_obj_exts() calls mark_objexts_empty(slabB->obj_exts) which will
generate a warning because it expects slabobj_ext vectors to have a NULL
obj_ext, not CODETAG_EMPTY.
Modify mark_objexts_empty() to skip the warning and setting the obj_ext
value if it's already set to CODETAG_EMPTY.
To quickly detect this WARN, I modified the code from
WARN_ON(slab_exts[offs].ref.ct) to BUG_ON(slab_exts[offs].ref.ct == 1);
We then obtained this message:
[21630.898561] ------------[ cut here ]------------
[21630.898596] kernel BUG at mm/slub.c:2050!
[21630.898611] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[21630.900372] Modules linked in: squashfs isofs vfio_iommu_type1
vhost_vsock vfio vhost_net vmw_vsock_virtio_transport_common vhost tap
vhost_iotlb iommufd vsock binfmt_misc nfsv3 nfs_acl nfs lockd grace
netfs tls rds dns_resolver tun brd overlay ntfs3 exfat btrfs
blake2b_generic xor xor_neon raid6_pq loop sctp ip6_udp_tunnel
udp_tunnel nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
nf_tables rfkill ip_set sunrpc vfat fat joydev sg sch_fq_codel nfnetlink
virtio_gpu sr_mod cdrom drm_client_lib virtio_dma_buf drm_shmem_helper
drm_kms_helper drm ghash_ce backlight virtio_net virtio_blk virtio_scsi
net_failover virtio_console failover virtio_mmio dm_mirror
dm_region_hash dm_log dm_multipath dm_mod fuse i2c_dev virtio_pci
virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring autofs4
aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject]
[21630.909177] CPU: 3 UID: 0 PID: 3787 Comm: kylin-process-m Kdump:
loaded Tainted: G W 6.18.0-rc1+ #74 PREEMPT(voluntary)
[21630.910495] Tainted: [W]=WARN
[21630.910867] Hardware name: QEMU KVM Virtual Machine, BIOS unknown
2/2/2022
[21630.911625] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[21630.912392] pc : __free_slab+0x228/0x250
[21630.912868] lr : __free_slab+0x18c/0x250[21630.913334] sp :
ffff8000a02f73e0
[21630.913830] x29: ffff8000a02f73e0 x28: fffffdffc43fc800 x27:
ffff0000c0011c40
[21630.914677] x26: ffff0000c000cac0 x25: ffff00010fe5e5f0 x24:
ffff000102199b40
[21630.915469] x23: 0000000000000003 x22: 0000000000000003 x21:
ffff0000c0011c40
[21630.916259] x20: fffffdffc4086600 x19: fffffdffc43fc800 x18:
0000000000000000
[21630.917048] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000000
[21630.917837] x14: 0000000000000000 x13: 0000000000000000 x12:
ffff70001405ee66
[21630.918640] x11: 1ffff0001405ee65 x10: ffff70001405ee65 x9 :
ffff800080a295dc
[21630.919442] x8 : ffff8000a02f7330 x7 : 0000000000000000 x6 :
0000000000003000
[21630.920232] x5 : 0000000024924925 x4 : 0000000000000001 x3 :
0000000000000007
[21630.921021] x2 : 0000000000001b40 x1 : 000000000000001f x0 :
0000000000000001
[21630.921810] Call trace:
[21630.922130] __free_slab+0x228/0x250 (P)
[21630.922669] free_slab+0x38/0x118
[21630.923079] free_to_partial_list+0x1d4/0x340
[21630.923591] __slab_free+0x24c/0x348
[21630.924024] ___cache_free+0xf0/0x110
[21630.924468] qlist_free_all+0x78/0x130
[21630.924922] kasan_quarantine_reduce+0x114/0x148
[21630.925525] __kasan_slab_alloc+0x7c/0xb0
[21630.926006] kmem_cache_alloc_noprof+0x164/0x5c8
[21630.926699] __alloc_object+0x44/0x1f8
[21630.927153] __create_object+0x34/0xc8
[21630.927604] kmemleak_alloc+0xb8/0xd8
[21630.928052] kmem_cache_alloc_noprof+0x368/0x5c8
[21630.928606] getname_flags.part.0+0xa4/0x610
[21630.929112] getname_flags+0x80/0xd8
[21630.929557] vfs_fstatat+0xc8/0xe0
[21630.929975] __do_sys_newfstatat+0xa0/0x100
[21630.930469] __arm64_sys_newfstatat+0x90/0xd8
[21630.931046] invoke_syscall+0xd4/0x258
[21630.931685] el0_svc_common.constprop.0+0xb4/0x240
[21630.932467] do_el0_svc+0x48/0x68
[21630.932972] el0_svc+0x40/0xe0
[21630.933472] el0t_64_sync_handler+0xa0/0xe8
[21630.934151] el0t_64_sync+0x1ac/0x1b0
[21630.934923] Code: aa1803e0 97ffef2b a9446bf9 17ffff9c (d4210000)
[21630.936461] SMP: stopping secondary CPUs
[21630.939550] Starting crashdump kernel...
[21630.940108] Bye!
Link: https://lkml.kernel.org/r/20251029014317.1533488-1-hao.ge@linux.dev
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: David Rientjes <rientjes@google.com>
Cc: gehao <gehao@kylinos.cn>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently mremap folio pte batch ignores the writable bit during figuring
out a set of similar ptes mapping the same folio. Suppose that the first
pte of the batch is writable while the others are not - set_ptes will end
up setting the writable bit on the other ptes, which is a violation of
mremap semantics. Therefore, use FPB_RESPECT_WRITE to check the writable
bit while determining the pte batch.
Link: https://lkml.kernel.org/r/20251028063952.90313-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
Reported-by: David Hildenbrand <david@redhat.com>
Debugged-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When emitting the order of the allocation for a hash table,
alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log
base 2 of the allocation size. This is not correct if the allocation size
is smaller than a page, and yields a negative value for the order as seen
below:
TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP
bind hash table entries: 32 (order: -2, 1024 bytes, linear)
Use get_order() to compute the order when emitting the hash table
information to correctly handle cases where the allocation size is smaller
than a page:
TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP
bind hash table entries: 32 (order: 0, 1024 bytes, linear)
Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.
This behavior might not be respected on truncation.
During truncation, the kernel splits a large folio in order to reclaim
memory. As a side effect, it unmaps the folio and destroys PMD mappings
of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
are preserved.
However, if the split fails, PMD mappings are preserved and the user will
not receive SIGBUS on any accesses within the PMD.
Unmap the folio on split failure. It will lead to refault as PTEs and
preserve SIGBUS semantics.
Make an exception for shmem/tmpfs that for long time intentionally mapped
with PMDs across i_size.
Link: https://lkml.kernel.org/r/20251027115636.82382-3-kirill@shutemov.name
Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios")
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Fix SIGBUS semantics with large folios", v3.
Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.
Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
failed due to missing SIGBUS. This was caused by my recent changes that
try to fault in the whole folio where possible:
19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
357b92761d94 ("mm/filemap: map entire large folio faultaround")
These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.
However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016. The kernel happily maps
PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
allocates PMD-size folios on any writes.
I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving a
SIGBUS signal when accessing beyond i_size. I cannot imagine how it could
be useful for the workload.
But apparently filesystem folks care a lot about preserving strict SIGBUS
semantics.
Generic/749 was introduced last year with reference to POSIX, but no real
workloads were mentioned. It also acknowledged the tmpfs deviation from
the test case.
POSIX indeed says[3]:
References within the address range starting at pa and
continuing for len bytes to whole pages following the end of an
object shall result in delivery of a SIGBUS signal.
The patchset fixes the regression introduced by recent changes as well as
more subtle SIGBUS breakage due to split failure on truncation.
This patch (of 2):
Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.
Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.
Darrick reported generic/749 breakage because of this.
However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.
Fix filemap_map_pages() and finish_fault() to not install:
- PTEs beyond i_size;
- PMD mappings across i_size;
Make an exception for shmem/tmpfs that for long time intentionally
mapped with PMDs across i_size.
Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 6795801366da ("xfs: Support large folios")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio split clears PG_has_hwpoisoned, but the flag should be preserved in
after-split folios containing pages with PG_hwpoisoned flag if the folio
is split to >0 order folios. Scan all pages in a to-be-split folio to
determine which after-split folios need the flag.
An alternatives is to change PG_has_hwpoisoned to PG_maybe_hwpoisoned to
avoid the scan and set it on all after-split folios, but resulting false
positive has undesirable negative impact. To remove false positive,
caller of folio_test_has_hwpoisoned() and folio_contain_hwpoisoned_page()
needs to do the scan. That might be causing a hassle for current and
future callers and more costly than doing the scan in the split code.
More details are discussed in [1].
This issue can be exposed via:
1. splitting a has_hwpoisoned folio to >0 order from debugfs interface;
2. truncating part of a has_hwpoisoned folio in
truncate_inode_partial_folio().
And later accesses to a hwpoisoned page could be possible due to the
missing has_hwpoisoned folio flag. This will lead to MCE errors.
Link: https://lore.kernel.org/all/CAHbLzkoOZm0PXxE9qwtF4gKR=cpRXrSrJ9V9Pm2DJexs985q4g@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20251023030521.473097-1-ziy@nvidia.com
Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Pankaj Raghav <kernel@pankajraghav.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, scan_get_next_rmap_item() walks every page address in a VMA to
locate mergeable pages. This becomes highly inefficient when scanning
large virtual memory areas that contain mostly unmapped regions, causing
ksmd to use large amount of cpu without deduplicating much pages.
This patch replaces the per-address lookup with a range walk using
walk_page_range(). The range walker allows KSM to skip over entire
unmapped holes in a VMA, avoiding unnecessary lookups. This problem was
previously discussed in [1].
Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;
int main() {
char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);
if (area == MAP_FAILED) {
perror("mmap() failed\n");
return -1;
}
/* Populate a single page such that we get an anon_vma. */
*area = 0;
/* Enable KSM. */
madvise(area, size, MADV_MERGEABLE);
pause();
return 0;
}
$ ./ksm-sparse &
$ echo 1 > /sys/kernel/mm/ksm/run
Without this patch ksmd uses 100% of the cpu for a long time (more then 1
hour in my test machine) scanning all the 32 TiB virtual address space
that contain only one mapped page. This makes ksmd essentially deadlocked
not able to deduplicate anything of value. With this patch ksmd walks
only the one mapped page and skips the rest of the 32 TiB virtual address
space, making the scan fast using little cpu.
Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com
Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com
Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1]
Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging")
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: craftfever <craftfever@airmail.cc>
Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If no stack depot is allocated yet, due to masking out __GFP_RECLAIM flags
kmsan called from kmalloc cannot allocate stack depot. kmsan fails to
record origin and report issues. This may result in KMSAN failing to
report issues.
Reusing flags from kmalloc without modifying them should be safe for kmsan.
For example, such chain of calls is possible:
test_uninit_kmalloc -> kmalloc -> __kmalloc_cache_noprof ->
slab_alloc_node -> slab_post_alloc_hook ->
kmsan_slab_alloc -> kmsan_internal_poison_memory.
Only when it is called in a context without flags present should
__GFP_RECLAIM flags be masked.
With this change all kmsan tests start working reliably.
Eric reported:
: Yes, KMSAN seems to be at least partially broken currently. Besides the
: fact that the kmsan KUnit test is currently failing (which I reported at
: https://lore.kernel.org/r/20250911175145.GA1376@sol), I've confirmed that
: the poly1305 KUnit test causes a KMSAN warning with Aleksei's patch
: applied but does not cause a warning without it. The warning did get
: reached via syzbot somehow
: (https://lore.kernel.org/r/751b3d80293a6f599bb07770afcef24f623c7da0.1761026343.git.xiaopei01@kylinos.cn/),
: so KMSAN must still work in some cases. But it didn't work for me.
Link: https://lkml.kernel.org/r/20250930115600.709776-2-aleksei.nikiforov@linux.ibm.com
Link: https://lkml.kernel.org/r/20251022030213.GA35717@sol
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Signed-off-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The order check and fallback loop is updating the index value on every
loop. This will cause the index to be wrongly aligned by a larger value
while the loop shrinks the order.
This may result in inserting and returning a folio of the wrong index and
cause data corruption with some userspace workloads [1].
[kasong@tencent.com: introduce a temporary variable to improve code]
Link: https://lkml.kernel.org/r/20251023065913.36925-1-ryncsn@gmail.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20251022105719.18321-1-ryncsn@gmail.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Page cache folios from a file system that support large block size (LBS)
can have minimal folio order greater than 0, thus a high order folio might
not be able to be split down to order-0. Commit e220917fa507 ("mm: split
a folio in minimum folio order chunks") bumps the target order of
split_huge_page*() to the minimum allowed order when splitting a LBS
folio. This causes confusion for some split_huge_page*() callers like
memory failure handling code, since they expect after-split folios all
have order-0 when split succeeds but in reality get min_order_for_split()
order folios and give warnings.
Fix it by failing a split if the folio cannot be split to the target
order. Rename try_folio_split() to try_folio_split_to_order() to reflect
the added new_order parameter. Remove its unused list parameter.
[The test poisons LBS folios, which cannot be split to order-0 folios, and
also tries to poison all memory. The non split LBS folios take more
memory than the test anticipated, leading to OOM. The patch fixed the
kernel warning and the test needs some change to avoid OOM.]
Link: https://lkml.kernel.org/r/20251017013630.139907-1-ziy@nvidia.com
Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- Fix for potential infinite loop in kmalloc_nolock() when debugging
is enabled for the cache (Vlastimil Babka)
* tag 'slab-for-6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: prevent infinite loop in kmalloc_nolock() with debugging
|
|
We want to expand usage of sheaves to all non-boot caches, including
kmalloc caches. Since sheaves themselves are also allocated by
kmalloc(), we need to prevent excessive or infinite recursion -
depending on sheaf size, the sheaf can be allocated from smaller, same
or larger kmalloc size bucket, there's no particular constraint.
This is similar to allocating the objext arrays so let's just reuse the
existing mechanisms for those. __GFP_NO_OBJ_EXT in alloc_empty_sheaf()
will prevent a nested kmalloc() from allocating a sheaf itself - it will
either have sheaves already, or fallback to a non-sheaf-cached
allocation (so bootstrap of sheaves in a kmalloc cache that allocates
sheaves from its own size bucket is possible). Additionally, reuse
OBJCGS_CLEAR_MASK to clear unwanted gfp flags from the nested
allocation.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-5-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The function is tricky and many of its tests are hard to understand. Try
to improve that by using more descriptively named variables and added
comments.
- rename 'prior' to 'old_head' to match the head and tail parameters
- introduce a 'bool was_full' to make it more obvious what we are
testing instead of the !prior and prior tests
- add or improve comments in various places to explain what we're doing
Also replace kmem_cache_has_cpu_partial() tests with
IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) which are compile-time constants.
We can do that because the kmem_cache_debug(s) case is handled upfront
via free_to_partial_list().
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-1-b8218e1ac7ef@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
CONFIG_SLUB_TINY minimizes the SLUB's memory overhead in multiple ways,
mainly by avoiding percpu caching of slabs and objects. It also reduces
code size by replacing some code paths with simplified ones through
ifdefs, but the benefits of that are smaller and would complicate the
upcoming changes.
Thus remove these code paths and associated ifdefs and simplify the code
base.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-4-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When a pfmemalloc allocation actually dips into reserves, the slab is
marked accordingly and non-pfmemalloc allocations should not be allowed
to allocate from it. The sheaves percpu caching currently doesn't follow
this rule, so implement it before we expand sheaves usage to all caches.
Make sure objects from pfmemalloc slabs don't end up in percpu sheaves.
When freeing, skip sheaves when freeing an object from pfmemalloc slab.
When refilling sheaves, use __GFP_NOMEMALLOC to override any pfmemalloc
context - the allocation will fallback to regular slab allocations when
sheaves are depleted and can't be refilled because of the override.
For kfree_rcu(), detect pfmemalloc slabs after processing the rcu_sheaf
after the grace period in __rcu_free_sheaf_prepare() and simply flush
it if any object is from pfmemalloc slabs.
For prefilled sheaves, try to refill them first with __GFP_NOMEMALLOC
and if it fails, retry without __GFP_NOMEMALLOC but then mark the sheaf
pfmemalloc, which makes it flushed back to slabs when returned.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-3-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
SLUB's internal bulk allocation __kmem_cache_alloc_bulk() can currently
allocate some objects from KFENCE, i.e. when refilling a sheaf. It works
but it's conceptually the wrong layer, as KFENCE allocations should only
happen when objects are actually handed out from slab to its users.
Currently for sheaf-enabled caches, slab_alloc_node() can return KFENCE
object via kfence_alloc(), but also via alloc_from_pcs() when a sheaf
was refilled with KFENCE objects. Continuing like this would also
complicate the upcoming sheaf refill changes.
Thus remove KFENCE allocation from __kmem_cache_alloc_bulk() and move it
to the places that return slab objects to users. slab_alloc_node() is
already covered (see above). Add kfence_alloc() to
kmem_cache_alloc_from_sheaf() to handle KFENCE allocations from
prefilled sheafs, with a comment that the caller should not expect the
sheaf size to decrease after every allocation because of this
possibility.
For kmem_cache_alloc_bulk() implement a different strategy to handle
KFENCE upfront and rely on internal batched operations afterwards.
Assume there will be at most once KFENCE allocation per bulk allocation
and then assign its index in the array of objects randomly.
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-2-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
In review of a followup work, Harry noticed a potential infinite loop.
Upon closed inspection, it already exists for kmalloc_nolock() on a
cache with debugging enabled, since commit af92793e52c3 ("slab:
Introduce kmalloc_nolock() and kfree_nolock().")
When alloc_single_from_new_slab() fails to trylock node list_lock, we
keep retrying to get partial slab or allocate a new slab. If we indeed
interrupted somebody holding the list_lock, the trylock fill fail
deterministically and we end up allocating and defer-freeing slabs
indefinitely with no progress.
To fix it, fail the allocation if spinning is not allowed. This is
acceptable in the restricted context of kmalloc_nolock(), especially
with debugging enabled.
Reported-by: Harry Yoo <harry.yoo@oracle.com>
Closes: https://lore.kernel.org/all/aQLqZjjq1SPD3Fml@hyeyoo/
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251103-fix-nolock-loop-v1-1-6e2b3e82b9da@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add a new filemap_get_folios_dirty() helper to look up existing dirty
folios in a range and add them to a folio_batch. This is to support
optimization of certain iomap operations that only care about dirty
folios in a target range. For example, zero range only zeroes the subset
of dirty pages over unwritten mappings, seek hole/data may use similar
logic in the future, etc.
Note that the helper is intended for use under internal fs locks.
Therefore it trylocks folios in order to filter out clean folios.
This loosely follows the logic from filemap_range_has_writeback().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-11-willy@infradead.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use core_param() and __core_param_cb() instead of __setup() or
__setup_param() to improve syntax checking and error messages.
Replace get_option() with kstrtouint(), because:
* the latter accepts a pointer to const char,
* these parameters should not accept ranges,
* error value can be passed directly to parser.
There is one more change apart from the parsing of numeric parameters:
slab_strict_numa parameter name must match exactly. Before this patch the
kernel would silently accept any option that starts with the name as an
undocumented alias.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/6ae7e0ddc72b7619203c07dd5103a598e12f713b.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The logic in wbc_to_tag() is widely used in file systems, so modify this
function to be inline and use it in file systems.
This patch has only passed compilation tests, but it should be fine.
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rename filemap_fdatawrite_range_kick to filemap_flush_range because it
is the ranged version of filemap_flush.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-11-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use filemap_fdatawrite_range and filemap_fdatawrite_range_kick instead
of the low-level __filemap_fdatawrite_range that requires the caller
to know the internals of the writeback_control structure and remove
__filemap_fdatawrite_range now that it is trivial and only two callers
would be left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-10-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace filemap_fdatawrite_wbc, which exposes a writeback_control to the
callers with a filemap_writeback helper that takes all the possible
arguments and declares the writeback_control itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-9-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
And rewrite filemap_fdatawrite to use filemap_fdatawrite_range instead
to have a simpler call chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-8-hch@lst.de
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Abstract out the btrfs-specific behavior of kicking off I/O on a number
of pages on an address_space into a well-defined helper.
Note: there is no kerneldoc comment for the new function because it is
not part of the public API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-7-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-2-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use __core_param_cb() to parse the "slab_debug" kernel parameter instead of
the obsolescent __setup(). For now, the parameter is not exposed in sysfs,
and no get ops is provided.
There is a slight change in behavior. Before this patch, the following
parameter would silently turn on full debugging for all slabs:
slub_debug_yada_yada_gotta_love_this=hail_satan!
This syntax is now rejected, and the parameter will be passed to user
space, making the kernel a holier place.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/9674b34861394088c7853edf8e9d2b439fd4b42f.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Since the string passed to slab_debug is never modified, use pointers to
const char in all places where it is processed.
No functional changes intended.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Reviewed-by: Christoph Lameter <cl@gentwo.org>
Link: https://patch.msgid.link/819095b921f6ae03bb54fd69ee4020e2a3aef675.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Two fixes for race conditions in obj_exts allocation (Hao Ge)
- Fix for slab accounting imbalance due to deferred slab decativation
(Vlastimil Babka)
* tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: Fix obj_ext mistakenly considered NULL due to race condition
slab: fix slab accounting imbalance due to defer_deactivate_slab()
slab: Avoid race on slab->obj_exts in alloc_slab_obj_exts
|
|
If two competing threads enter alloc_slab_obj_exts(), and the one that
allocates the vector wins the cmpxchg(), the other thread that failed
allocation mistakenly assumes that slab->obj_exts is still empty due to
its own allocation failure. This will then trigger warnings with
CONFIG_MEM_ALLOC_PROFILING_DEBUG checks in the subsequent free path.
Therefore, let's check the result of cmpxchg() to see if marking the
allocation as failed was successful. If it wasn't, check whether the
winning side has succeeded its allocation (it might have been also
marking it as failed) and if yes, return success.
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Link: https://patch.msgid.link/20251023143313.1327968-1-hao.ge@linux.dev
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and
kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
that we discard the newly allocated slab if we can't spin and we fail to
trylock. As a result we don't perform inc_slabs_node() later in the
function. Instead we perform a deferred deactivate_slab() which can
either put the unacounted slab on partial list, or discard it
immediately while performing dec_slabs_node(). Either way will cause an
accounting imbalance.
Fix this by not marking the slab as frozen, and using free_slab()
instead of deactivate_slab() for non-frozen slabs in
free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
case. By not using discard_slab() we avoid dec_slabs_node().
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Link: https://patch.msgid.link/20251023-fix-slab-accounting-v2-1-0e62d50986ea@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"17 hotfixes. 12 are cc:stable and 14 are for MM.
There's a two-patch DAMON series from SeongJae Park which addresses a
missed check and possible memory leak. Apart from that it's all
singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
csky: abiv2: adapt to new folio flags field
mm/damon/core: use damos_commit_quota_goal() for new goal commit
mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme
hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()
vmw_balloon: indicate success when effectively deflating during migration
mm/damon/core: fix list_add_tail() call on damon_call()
mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap
mm: prevent poison consumption when splitting THP
ocfs2: clear extent cache after moving/defragmenting extents
mm: don't spin in add_stack_record when gfp flags don't allow
dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC
mm/damon/sysfs: dealloc commit test ctx always
mm/damon/sysfs: catch commit test ctx alloc failure
hung_task: fix warnings caused by unaligned lock pointers
|
|
Prior to this change, no security hooks were called at the creation of a
memfd file. It means that, for SELinux as an example, it will receive
the default type of the filesystem that backs the in-memory inode. In
most cases, that would be tmpfs, but if MFD_HUGETLB is passed, it will
be hugetlbfs. Both can be considered implementation details of memfd.
It also means that it is not possible to differentiate between a file
coming from memfd_create and a file coming from a standard tmpfs mount
point.
Additionally, no permission is validated at creation, which differs from
the similar memfd_secret syscall.
Call security_inode_init_security_anon during creation. This ensures
that the file is setup similarly to other anonymous inodes. On SELinux,
it means that the file will receive the security context of its task.
The ability to limit fexecve on memfd has been of interest to avoid
potential pitfalls where /proc/self/exe or similar would be executed
[1][2]. Reuse the "execute_no_trans" and "entrypoint" access vectors,
similarly to the file class. These access vectors may not make sense for
the existing "anon_inode" class. Therefore, define and assign a new
class "memfd_file" to support such access vectors.
Guard these changes behind a new policy capability named "memfd_class".
[1] https://crbug.com/1305267
[2] https://lore.kernel.org/lkml/20221215001205.51969-1-jeffxu@google.com/
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Tested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
When damos_commit_quota_goals() is called for adding new DAMOS quota goals
of DAMOS_QUOTA_USER_INPUT metric, current_value fields of the new goals
should be also set as requested.
However, damos_commit_quota_goals() is not updating the field for the
case, since it is setting only metrics and target values using
damos_new_quota_goal(), and metric-optional union fields using
damos_commit_quota_goal_union(). As a result, users could see the first
current_value parameter that committed online with a new quota goal is
ignored. Users are assumed to commit the current_value for
DAMOS_QUOTA_USER_INPUT quota goals, since it is being used as a feedback.
Hence the real impact would be subtle. That said, this is obviously not
intended behavior.
Fix the issue by using damos_commit_quota_goal() which sets all quota goal
parameters, instead of damos_commit_quota_goal_union(), which sets only
the union fields.
Link: https://lkml.kernel.org/r/20251014001846.279282-1-sj@kernel.org
Fixes: 1aef9df0ee90 ("mm/damon/core: commit damos_quota_goal->nid")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_destroy_scheme
Currently, damon_destroy_scheme() only cleans up the filter list but
leaves ops_filter untouched, which could lead to memory leaks when a
scheme is destroyed.
This patch ensures both filter and ops_filter are properly freed in
damon_destroy_scheme(), preventing potential memory leaks.
Link: https://lkml.kernel.org/r/20251014084225.313313-1-lienze@kylinos.cn
Fixes: ab82e57981d0 ("mm/damon/core: introduce damos->ops_filters")
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When hugetlb_vmdelete_list() processes VMAs during truncate operations, it
may encounter VMAs where huge_pmd_unshare() is called without the required
shareable lock. This triggers an assertion failure in
hugetlb_vma_assert_locked().
The previous fix in commit dd83609b8898 ("hugetlbfs: skip VMAs without
shareable locks in hugetlb_vmdelete_list") skipped entire VMAs without
shareable locks to avoid the assertion. However, this prevented pages
from being unmapped and freed, causing a regression in
fallocate(PUNCH_HOLE) operations where pages were not freed immediately,
as reported by Mark Brown.
Instead of checking locks in the caller or skipping VMAs, move the lock
assertions in huge_pmd_unshare() to after the early return checks. The
assertions are only needed when actual PMD unsharing work will be
performed. If the function returns early because sz != PMD_SIZE or the
PMD is not shared, no locks are required and assertions should not fire.
This approach reverts the VMA skipping logic from commit dd83609b8898
("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
while moving the assertions to avoid the assertion failure, keeping all
the logic within huge_pmd_unshare() itself and allowing page unmapping and
freeing to proceed for all VMAs.
Link: https://lkml.kernel.org/r/20251014113344.21194-1-kartikey406@gmail.com
Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://syzkaller.appspot.com/bug?extid=f26d7c75c26ec19790e7
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Each damon_ctx maintains callback requests using a linked list
(damon_ctx->call_controls). When a new callback request is received via
damon_call(), the new request should be added to the list. However, the
function is making a mistake at list_add_tail() invocation: putting the
new item to add and the list head to add it before, in the opposite order.
Because of the linked list manipulation implementation, the new request
can still be reached from the context's list head. But the list items
that were added before the new request are dropped from the list.
As a result, the callbacks are unexpectedly not invocated. Worse yet, if
the dropped callback requests were dynamically allocated, the memory is
leaked. Actually DAMON sysfs interface is using a dynamically allocated
repeat-mode callback request for automatic essential stats update. And
because the online DAMON parameters commit is using a non-repeat-mode
callback request, the issue can easily be reproduced, like below.
# damo start --damos_action stat --refresh_stat 1s
# damo tune --damos_action stat --refresh_stat 1s
The first command dynamically allocates the repeat-mode callback request
for automatic essential stat update. Users can see the essential stats
are automatically updated for every second, using the sysfs interface.
The second command calls damon_commit() with a new callback request that
was made for the commit. As a result, the previously added repeat-mode
callback request is dropped from the list. The automatic stats refresh
stops working, and the memory for the repeat-mode callback request is
leaked. It can be confirmed using kmemleak.
Fix the mistake on the list_add_tail() call.
Link: https://lkml.kernel.org/r/20251014205939.1206-1-sj@kernel.org
Fixes: 004ded6bee11 ("mm/damon: accept parallel damon_call() requests")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit b714ccb02a76 ("mm/mremap: complete refactor of move_vma()")
mistakenly introduced a new behaviour - clearing the VM_ACCOUNT flag of
the old mapping when a mapping is mremap()'d with the MREMAP_DONTUNMAP
flag set.
While we always clear the VM_LOCKED and VM_LOCKONFAULT flags for the old
mapping (the page tables have been moved, so there is no data that could
possibly be locked in memory), there is no reason to touch any other VMA
flags.
This is because after the move the old mapping is in a state as if it were
freshly mapped. This implies that the attributes of the mapping ought to
remain the same, including whether or not the mapping is accounted.
Link: https://lkml.kernel.org/r/20251013165836.273113-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Fixes: b714ccb02a76 ("mm/mremap: complete refactor of move_vma()")
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If two competing threads enter alloc_slab_obj_exts() and one of them
fails to allocate the object extension vector, it might override the
valid slab->obj_exts allocated by the other thread with
OBJEXTS_ALLOC_FAIL. This will cause the thread that lost this race and
expects a valid pointer to dereference a NULL pointer later on.
Update slab->obj_exts atomically using cmpxchg() to avoid
slab->obj_exts overrides by racing threads.
Thanks for Vlastimil and Suren's help with debugging.
Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://patch.msgid.link/20251021010353.1187193-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Sumanth Korikkar says:
====================
Provide a new interface for dynamic configuration and
deconfiguration of hotplug memory on s390, allowing with/without
memmap_on_memory support. It is a follow up on the discussion with David
when introducing memmap_on_memory support for s390 and support dynamic
(de)configuration of memory:
https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/
The original motivation for introducing memmap_on_memory on s390 was to
avoid using online memory to store struct pages metadata, particularly
for standby memory blocks. This became critical in cases where there was
an imbalance between standby and online memory, potentially leading to
boot failures due to insufficient memory for metadata allocation.
To address this, memmap_on_memory was utilized on s390. However, in its
current form, it adds struct pages metadata at the start of each memory
block at the time of addition (only standby memory), and this
configuration is static. It cannot be changed at runtime (When the user
needs continuous physical memory).
Inorder to provide more flexibility to the user and overcome the above
limitation, add an option to dynamically configure and deconfigure
hotpluggable memory block with/without memmap_on_memory.
With the new interface, s390 will not add all possible hotplug memory in
advance, like before, to make it visible in sysfs for online/offline
actions. Instead, before memory block can be set online, it has to be
configured via a new interface in /sys/firmware/memory/memoryX/config,
which makes s390 similar to others. i.e. Adding of hotpluggable memory is
controlled by the user instead of adding it at boottime.
s390 kernel sysfs interface to configure/deconfigure memory with
memmap_on_memory (with upcoming lsmem changes):
* Initial memory layout:
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
* Configure memory
echo 1 > /sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0x87ffffff 128M offline 16 yes yes
0x88000000-0xffffffff 1.9G offline 17-31 no yes
* Deconfigure memory
echo 0 > /sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
* Enable memmap_on_memory and online it.
(Deconfigure first)
echo 0 > /sys/devices/system/memory/memory5/online
echo 0 > /sys/firmware/memory/memory5/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M offline 5 no no
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
(Enable memmap_on_memory and online it)
echo 1 > /sys/firmware/memory/memory5/memmap_on_memory
echo 1 > /sys/firmware/memory/memory5/config
echo 1 > /sys/devices/system/memory/memory5/online
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M online 5 yes yes
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
* Disable memmap_on_memory and online it.
(Deconfigure first)
echo 0 > /sys/devices/system/memory/memory5/online
echo 0 > /sys/firmware/memory/memory5/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M offline 5 no yes
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
(Disable memmap_on_memory and online it)
echo 0 > /sys/firmware/memory/memory5/memmap_on_memory
echo 1 > /sys/firmware/memory/memory5/config
echo 1 > /sys/devices/system/memory/memory5/online
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
* Userspace changes:
lsmem/chmem tool is also changed to use the new interface. I will send
it to util-linux soon.
Patch 1 adds support for removal of boot-allocated memory blocks.
Patch 2 provides option to dynamically configure and deconfigure memory
with/without memmap_on_memory.
Patch 3 removes MHP_OFFLINE_INACCESSIBLE from s390. The mhp flag was
used to mark memory as not accessible until memory hotplug online phase
begins. However, with patch 2, it is no longer essential. Memory can be
brought to accessible state before adding memory, as the memory is added
during runttime now instead of boottime.
Patch 4 removes the MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. It
is no longer needed. Memory can be brought to accessible state before
adding memory now, with runtime (de)configuration of memory.
Note: The patches apply to the linux-next branch.
v3:
Thanks David
* Avoid goto label in create_standby_sclp_mems().
* Use unsigned long instead of u64.
* Add Acked-by.
v2:
Thanks David
* Rename struct mblock/mblock_arg with struct sclp_mem/sclp_mem_arg.
* Rename all mblocks/mblock references with sclp_mems/sclp_mem -
structures, functions.
* Rename create_online_mblock() with create_configured_sclp_mem().
* Rename config_mblock_show()/config_mblock_store() with
config_sclp_mem_show()/config_sclp_mem_store().
* Remove contains_standby_increment() and
sclp_mem_notifier. sclp mem state change is performed when
adding/removing memory. sclp memory notifier - no longer needed with
this patchset.
* Recover sclp mem state when add_memory() fails.
* Refactor and add function init_sclp_mem().
* Use unsigned long instead of unsigned long long.
* Simplify and correct kobj handling. Thanks Heiko.
====================
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.
The script:
@@
expression inode, flags;
@@
- inode->i_state & flags
+ inode_state_read(inode) & flags
@@
expression inode, flags;
@@
- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)
@@
expression inode, flag1, flag2;
@@
- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)
@@
expression inode, flags;
@@
- inode->i_state |= flags
+ inode_state_set(inode, flags)
@@
expression inode, flags;
@@
- inode->i_state = flags
+ inode_state_assign(inode, flags)
@@
expression inode, flags;
@@
- flags = inode->i_state
+ flags = inode_state_read(inode)
@@
expression inode, flags;
@@
- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If obj_exts allocation failed, slab->obj_exts is set to OBJEXTS_ALLOC_FAIL,
But we do not clear it when freeing the slab. Since OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS currently share the same bit position, during the
release of the associated folio, a VM_BUG_ON_FOLIO() check in
folio_memcg_kmem() is triggered because the OBJEXTS_ALLOC_FAIL flag was
not cleared, causing it to be interpreted as a kmem folio (non-slab)
with MEMCG_OBJEXTS_DATA flag set, which is invalid because
MEMCG_OBJEXTS_DATA is supposed to be set only on slabs.
Another problem that predates sharing the OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS bits is that on configurations with
is_check_pages_enabled(), the non-cleared bit in page->memcg_data will
trigger a free_page_is_bad() failure "page still charged to cgroup"
When freeing a slab, we clear slab->obj_exts if the obj_ext array has
been successfully allocated. So let's clear it also when the allocation
has failed.
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Fixes: 7612833192d5 ("slab: Reuse first bit for OBJEXTS_ALLOC_FAIL")
Link: https://lore.kernel.org/all/20251015141642.700170-1-hao.ge@linux.dev/
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered
by an in-userspace #MC necessitates splitting the THP. The splitting
process employs a mechanism, implemented in
try_to_map_unused_to_zeropage(), which reads the pages in the THP to
identify zero-filled pages. However, reading the pages in the THP results
in a second in-kernel #MC, occurring before the initial memory_failure()
completes, ultimately leading to a kernel panic. See the kernel panic
call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access pages in the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the page in the THP
by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned page in the THP during zeropage identification, while continuing
to scan unaffected pages in the THP for possible zeropage mapping. This
prevents a second in-kernel #MC that would cause kernel panic in Step[4].
Thanks to Andrew Zaborowski for his initial work on fixing this issue.
Link: https://lkml.kernel.org/r/20251015064926.1887643-1-qiuxu.zhuo@intel.com
Link: https://lkml.kernel.org/r/20251011075520.320862-1-qiuxu.zhuo@intel.com
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot was able to find the following path:
add_stack_record_to_list mm/page_owner.c:182 [inline]
inc_stack_record_count mm/page_owner.c:214 [inline]
__set_page_owner+0x2c3/0x4a0 mm/page_owner.c:333
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
prep_new_page mm/page_alloc.c:1859 [inline]
get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858
alloc_pages_nolock_noprof+0x94/0x120 mm/page_alloc.c:7554
Don't spin in add_stack_record_to_list() when it is called
from *_nolock() context.
Link: https://lkml.kernel.org/r/CAADnVQK_8bNYEA7TJYgwTYR57=TTFagsvRxp62pFzS_z129eTg@mail.gmail.com
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: syzbot+8259e1d0e3ae8ed0c490@syzkaller.appspotmail.com
Reported-by: syzbot+665739f456b28f32b23d@syzkaller.appspotmail.com
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The damon_ctx for testing online DAMON parameters commit inputs is
deallocated only when the test fails. This means memory is leaked for
every successful online DAMON parameters commit. Fix the leak by always
deallocating it.
Link: https://lkml.kernel.org/r/20251003201455.41448-3-sj@kernel.org
Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/sysfs: fix commit test damon_ctx [de]allocation".
DAMON sysfs interface dynamically allocates and uses a damon_ctx object
for testing if given inputs for online DAMON parameters update is valid.
The object is being used without an allocation failure check, and leaked
when the test succeeds. Fix the two bugs.
This patch (of 2):
The damon_ctx for testing online DAMON parameters commit inputs is used
without its allocation failure check. This could result in an invalid
memory access. Fix it by directly returning an error when the allocation
failed.
Link: https://lkml.kernel.org/r/20251003201455.41448-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251003201455.41448-2-sj@kernel.org
Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
defer_free() links pending objects using the slab's freelist offset
which is fine as they are not free yet. free_deferred_objects() then
clears this pointer to avoid confusing the debugging consistency checks
that may be enabled for the cache.
However, with CONFIG_SLAB_FREELIST_HARDENED, even the NULL pointer needs
to be encoded appropriately using set_freepointer(), otherwise it's
decoded as something else and triggers the consistency checks, as found
by the kernel test robot.
Use set_freepointer() to prevent the issue.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202510101652.7921fdc6-lkp@intel.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers were introduced
to prepare the transition of memory to and from a physically accessible
state. This enhancement was crucial for implementing the "memmap on memory"
feature for s390.
With introduction of dynamic (de)configuration of hotpluggable memory,
memory can be brought to accessible state before add_memory(). Memory
can be brought to inaccessible state before remove_memory(). Hence,
there is no need of MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory
notifiers anymore.
This basically reverts commit
c5f1e2d18909 ("mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers")
Additionally, apply minor adjustments to the function parameters of
move_pfn_range_to_zone() and mhp_supports_memmap_on_memory() to ensure
compatibility with the latest branch.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
"A NULL pointer deref hotfix"
* tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: fix barn NULL pointer dereference on memoryless nodes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 hotfixes. All 7 are cc:stable and all 7 are for MM.
All singletons, please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: hugetlb: avoid soft lockup when mprotect to large memory area
fsnotify: pass correct offset to fsnotify_mmap_perm()
mm/ksm: fix flag-dropping behavior in ksm_madvise
mm/damon/vaddr: do not repeat pte_offset_map_lock() until success
mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage
mm/thp: fix MTE tag mismatch when replacing zero-filled subpages
memcg: skip cgroup_file_notify if spinning is not allowed
|
|
Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."
# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.
The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.
For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.
Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.
Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fixes for several corner cases in error paths and debugging options,
related to the new kmalloc_nolock() functionality (Kuniyuki Iwashima,
Ran Xiaokai)
* tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slub: Don't call lockdep_unregister_key() for immature kmem_cache.
slab: Fix using this_cpu_ptr() in preemptible context
slab: Add allow_spin check to eliminate kmemleak warnings
|
|
When calling mprotect() to a large hugetlb memory area in our customer's
workload (~300GB hugetlb memory), soft lockup was observed:
watchdog: BUG: soft lockup - CPU#98 stuck for 23s! [t2_new_sysv:126916]
CPU: 98 PID: 126916 Comm: t2_new_sysv Kdump: loaded Not tainted 6.17-rc7
Hardware name: GIGACOMPUTING R2A3-T40-AAV1/Jefferson CIO, BIOS 5.4.4.1 07/15/2025
pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mte_clear_page_tags+0x14/0x24
lr : mte_sync_tags+0x1c0/0x240
sp : ffff80003150bb80
x29: ffff80003150bb80 x28: ffff00739e9705a8 x27: 0000ffd2d6a00000
x26: 0000ff8e4bc00000 x25: 00e80046cde00f45 x24: 0000000000022458
x23: 0000000000000000 x22: 0000000000000004 x21: 000000011b380000
x20: ffff000000000000 x19: 000000011b379f40 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc875e0aa5e2c
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : fffffc01ce7a5c00 x4 : 00000000046cde00 x3 : fffffc0000000000
x2 : 0000000000000004 x1 : 0000000000000040 x0 : ffff0046cde7c000
Call trace:
mte_clear_page_tags+0x14/0x24
set_huge_pte_at+0x25c/0x280
hugetlb_change_protection+0x220/0x430
change_protection+0x5c/0x8c
mprotect_fixup+0x10c/0x294
do_mprotect_pkey.constprop.0+0x2e0/0x3d4
__arm64_sys_mprotect+0x24/0x44
invoke_syscall+0x50/0x160
el0_svc_common+0x48/0x144
do_el0_svc+0x30/0xe0
el0_svc+0x30/0xf0
el0t_64_sync_handler+0xc4/0x148
el0t_64_sync+0x1a4/0x1a8
Soft lockup is not triggered with THP or base page because there is
cond_resched() called for each PMD size.
Although the soft lockup was triggered by MTE, it should be not MTE
specific. The other processing which takes long time in the loop may
trigger soft lockup too.
So add cond_resched() for hugetlb to avoid soft lockup.
Link: https://lkml.kernel.org/r/20250929202402.1663290-1-yang@os.amperecomputing.com
Fixes: 8f860591ffb2 ("[PATCH] Enable mprotect on huge pages")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fsnotify_mmap_perm() requires a byte offset for the file about to be
mmap'ed. But it is called from vm_mmap_pgoff(), which has a page offset.
Previously the conversion was done incorrectly so let's fix it, being
careful not to overflow on 32-bit platforms.
Discovered during code review.
Link: https://lkml.kernel.org/r/20251003155238.2147410-1-ryan.roberts@arm.com
Fixes: 066e053fe208 ("fsnotify: add pre-content hooks on mmap()")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON's virtual address space operation set implementation (vaddr) calls
pte_offset_map_lock() inside the page table walk callback function. This
is for reading and writing page table accessed bits. If
pte_offset_map_lock() fails, it retries by returning the page table walk
callback function with ACTION_AGAIN.
pte_offset_map_lock() can continuously fail if the target is a pmd
migration entry, though. Hence it could cause an infinite page table walk
if the migration cannot be done until the page table walk is finished.
This indeed caused a soft lockup when CPU hotplugging and DAMON were
running in parallel.
Avoid the infinite loop by simply not retrying the page table walk. DAMON
is promising only a best-effort accuracy, so missing access to such pages
is no problem.
Link: https://lkml.kernel.org/r/20250930004410.55228-1-sj@kernel.org
Fixes: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Xinyu Zheng <zhengxinyu6@huawei.com>
Closes: https://lore.kernel.org/20250918030029.2652607-1-zhengxinyu6@huawei.com
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
subpage to shared zeropage
When splitting an mTHP and replacing a zero-filled subpage with the shared
zeropage, try_to_map_unused_to_zeropage() currently drops several
important PTE bits.
For userspace tools like CRIU, which rely on the soft-dirty mechanism for
incremental snapshots, losing the soft-dirty bit means modified pages are
missed, leading to inconsistent memory state after restore.
As pointed out by David, the more critical uffd-wp bit is also dropped.
This breaks the userfaultfd write-protection mechanism, causing writes to
be silently missed by monitoring applications, which can lead to data
corruption.
Preserve both the soft-dirty and uffd-wp bits from the old PTE when
creating the new zeropage mapping to ensure they are correctly tracked.
Link: https://lkml.kernel.org/r/20250930081040.80926-1-lance.yang@linux.dev
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When both THP and MTE are enabled, splitting a THP and replacing its
zero-filled subpages with the shared zeropage can cause MTE tag mismatch
faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the
zeropage has a fixed tag of zero, which may not match the tag expected by
the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64
intentionally reports MTE-tagged pages as non-identical to prevent unsafe
merging.
As suggested by David[1], this patch adopts the same pattern, replacing the
memchr_inv() byte-level check with a call to pages_identical(). This
leverages existing architecture-specific logic to determine if a page is
truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the
design more future-proof, IMO. Instead of handling quirks in generic code,
we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Link: https://lkml.kernel.org/r/20250922021458.68123-1-lance.yang@linux.dev
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com>
Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: andrew.yang <andrew.yang@mediatek.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Generally memcg charging is allowed from all the contexts including NMI
where even spinning on spinlock can cause locking issues. However one
call chain was missed during the addition of memcg charging from any
context support. That is try_charge_memcg() -> memcg_memory_event() ->
cgroup_file_notify().
The possible function call tree under cgroup_file_notify() can acquire
many different spin locks in spinning mode. Some of them are
cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
just skip cgroup_file_notify() from memcg charging if the context does not
allow spinning.
Alternative approach was also explored where instead of skipping
cgroup_file_notify(), we defer the memcg event processing to irq_work [1].
However it adds complexity and it was decided to keep things simple until
we need more memcg events with !allow_spinning requirement.
Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
Link: https://lkml.kernel.org/r/20250922220203.261714-1-shakeel.butt@linux.dev
Fixes: 3ac4638a734a ("memcg: make memcg_rstat_updated nmi safe")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Closes: https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peilin Ye <yepeilin@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
to make it clear that KHO operates on pages rather than on a random
physical address.
The kho_preserve_pages() will be also used in upcoming support for vmalloc
preservation.
Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two small fixes for the recently performed code refactoring (Shigeru
Yoshida) and missing handling of direction parameter in DMA debug code
(Petr Tesarik)"
* tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: fix direction in dma_alloc direction traces
kmsan: fix kmsan_handle_dma() to avoid false positives
|
|
syzbot reported the lockdep splat below in __kmem_cache_release(). [0]
The problem is that __kmem_cache_release() could be called from
do_kmem_cache_create() before init_kmem_cache_cpus() registers
the lockdep key.
Let's perform lockdep_unregister_key() only when init_kmem_cache_cpus()
has been done, which we can determine by checking s->cpu_slab
[0]:
WARNING: CPU: 1 PID: 6128 at kernel/locking/lockdep.c:6606 lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606
Modules linked in:
CPU: 1 UID: 0 PID: 6128 Comm: syz.4.21 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606
Code: 50 e4 0f 48 3b 44 24 10 0f 84 26 fe ff ff e8 bd cd 17 09 e8 e8 ce 17 09 41 f7 c7 00 02 00 00 74 bd fb 40 84 ed 75 bc eb cd 90 <0f> 0b 90 e9 19 ff ff ff 90 0f 0b 90 e9 2a ff ff ff 48 c7 c7 d0 ac
RSP: 0018:ffffc90003e870d0 EFLAGS: 00010002
RAX: eb1525397f5bdf00 RBX: ffff88803c121148 RCX: 1ffff920007d0dfc
RDX: 0000000000000000 RSI: ffffffff8acb1500 RDI: ffffffff8b1dd0e0
RBP: 00000000ffffffea R08: ffffffff8eb5aa37 R09: 1ffffffff1d6b546
R10: dffffc0000000000 R11: fffffbfff1d6b547 R12: 0000000000000000
R13: ffff88814d1b8900 R14: 0000000000000000 R15: 0000000000000203
FS: 00007f773f75e6c0(0000) GS:ffff88812712f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffdaea3af52 CR3: 000000003a5ca000 CR4: 00000000003526f0
Call Trace:
<TASK>
__kmem_cache_release+0xe3/0x1e0 mm/slub.c:7696
do_kmem_cache_create+0x74e/0x790 mm/slub.c:8575
create_cache mm/slab_common.c:242 [inline]
__kmem_cache_create_args+0x1ce/0x330 mm/slab_common.c:340
nfsd_file_cache_init+0x1d6/0x530 fs/nfsd/filecache.c:816
nfsd_startup_generic fs/nfsd/nfssvc.c:282 [inline]
nfsd_startup_net fs/nfsd/nfssvc.c:377 [inline]
nfsd_svc+0x393/0x900 fs/nfsd/nfssvc.c:786
nfsd_nl_threads_set_doit+0x84a/0x960 fs/nfsd/nfsctl.c:1639
genl_family_rcv_msg_doit+0x212/0x300 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x219/0x270 net/socket.c:742
____sys_sendmsg+0x508/0x820 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f77400eeec9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f773f75e038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f7740345fa0 RCX: 00007f77400eeec9
RDX: 0000000000008004 RSI: 0000200000000180 RDI: 0000000000000006
RBP: 00007f7740171f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f7740346038 R14: 00007f7740345fa0 R15: 00007ffce616f8d8
</TASK>
[alexei.starovoitov@gmail.com: simplify the fix]
Link: https://lore.kernel.org/all/20251007052534.2776661-1-kuniyu@google.com/
Fixes: 83382af9ddc3 ("slab: Make slub local_(try)lock more precise for LOCKDEP")
Reported-by: syzbot+a6f4d69b9b23404bbabf@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e4a3d1.a00a0220.298cc0.0471.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
defer_free() maybe called in preemptible context, this will trigger the
below warning message:
BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
caller is defer_free+0x1b/0x60
Call Trace:
<TASK>
dump_stack_lvl+0xac/0xc0
check_preemption_disabled+0xbe/0xe0
defer_free+0x1b/0x60
kfree_nolock+0x1eb/0x2b0
alloc_slab_obj_exts+0x356/0x390
__alloc_tagging_slab_alloc_hook+0xa0/0x300
__kmalloc_cache_noprof+0x1c4/0x5c0
__set_page_owner+0x10d/0x1c0
post_alloc_hook+0x84/0xf0
get_page_from_freelist+0x73b/0x1380
__alloc_frozen_pages_noprof+0x110/0x2c0
alloc_pages_mpol+0x44/0x140
alloc_slab_page+0xac/0x150
allocate_slab+0x78/0x3a0
___slab_alloc+0x76b/0xed0
__slab_alloc.constprop.0+0x5a/0xb0
__kmalloc_noprof+0x3dc/0x6d0
__list_lru_init+0x6c/0x210
alloc_super+0x3b6/0x470
sget_fc+0x5f/0x3a0
get_tree_nodev+0x27/0x90
vfs_get_tree+0x26/0xc0
vfs_kern_mount.part.0+0xb6/0x140
kern_mount+0x24/0x40
init_pipe_fs+0x4f/0x70
do_one_initcall+0x62/0x2e0
kernel_init_freeable+0x25b/0x4b0
kernel_init+0x1a/0x1c0
ret_from_fork+0x290/0x2e0
ret_from_fork_asm+0x11/0x20
</TASK>
Disable preemption in defer_free() and also defer_deactivate_slab() to
make it safe.
[vbabka@suse.cz: disable preemption instead of using raw_cpu_ptr() per
the discussion ]
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Link: https://lore.kernel.org/r/20250930083402.782927-1-ranxiaokai627@163.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
"Only two patch series in this pull request:
- "mm/memory_hotplug: fixup crash during uevent handling" from Hannes
Reinecke fixes a race that was causing udev to trigger a crash in
the memory hotplug code
- "mm_slot: following fixup for usage of mm_slot_entry()" from Wei
Yang adds some touchups to the just-merged mm_slot changes"
* tag 'mm-stable-2025-10-03-16-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/khugepaged: use KMEM_CACHE()
mm/ksm: cleanup mm_slot_entry() invocation
Documentation/mm: drop pxx_mkdevmap() descriptions from page table helpers
mm: clean up is_guard_pte_marker()
drivers/base: move memory_block_add_nid() into the caller
mm/memory_hotplug: activate node before adding new memory blocks
drivers/base/memory: add node id parameter to add_memory_block()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull mm-init update from Mike Rapoport:
"Simplify deferred initialization of struct pages
Refactor and simplify deferred initialization of the memory map.
Beside the negative diffstat it gives 3ms (55ms vs 58ms) reduction in
the initialization of deferred pages on single node system with 64GiB
of RAM"
* tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: drop for_each_free_mem_pfn_range_in_zone_from()
mm/mm_init: drop deferred_init_maxorder()
mm/mm_init: deferred_init_memmap: use a job per zone
mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
- Refactoring of DMA mapping API to physical addresses as the primary
interface instead of page+offset parameters
This gets much closer to Matthew Wilcox's long term wish for
struct-pageless IO to cacheable DRAM and is supporting memdesc
project which seeks to substantially transform how struct page works.
An advantage of this approach is the possibility of introducing
DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the
common paths, what in turn lets to use recently introduced
dma_iova_link() API to map PCI P2P MMIO without creating struct page
Developped by Leon Romanovsky and Jason Gunthorpe
- Minor clean-up by Petr Tesarik and Qianfeng Rong
* tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
kmsan: fix missed kmsan_handle_dma() signature conversion
mm/hmm: properly take MMIO path
mm/hmm: migrate to physical address-based DMA mapping API
dma-mapping: export new dma_*map_phys() interface
xen: swiotlb: Open code map_resource callback
dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs()
kmsan: convert kmsan_handle_dma to use physical addresses
dma-mapping: convert dma_direct_*map_page to be phys_addr_t based
iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys()
iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys
dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys
dma-debug: refactor to use physical addresses for page mapping
iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link().
dma-mapping: introduce new DMA attribute to indicate MMIO memory
swiotlb: Remove redundant __GFP_NOWARN
dma-direct: clean up the logic in __dma_direct_alloc_pages()
|
|
We got some late review commits during review of commit b4c9ffb54b32
("mm/khugepaged: remove definition of struct khugepaged_mm_slot"). No
need to keep the old cache name "khugepaged_mm_slot", let's simply use
KMEM_CACHE().
Link: https://lkml.kernel.org/r/20251001091900.20041-3-richard.weiyang@gmail.com
Fixes: b4c9ffb54b32 ("mm/khugepaged: remove definition of struct khugepaged_mm_slot")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: SeongJae Park <sj@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Kiryl Shutsemau <kas@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm_slot: following fixup for usage of mm_slot_entry()", v2.
We got some late review commits during review of "mm_slot: fix the usage of
mm_slot_entry()" in [1].
This patch (of 2):
We got some late review commits during review of commit 08498be43ee6
("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL"). Let's
reduce the indentation level and make the code easier to follow by using
gotos to a new label.
Link: https://lkml.kernel.org/r/20251001091900.20041-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20251001091900.20041-2-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250927004539.19308-1-richard.weiyang@gmail.com [1]
Fixes: 08498be43ee6 ("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's simplify the implementation. The current code is redundant as it
effectively expands to:
is_swap_pte(pte) &&
is_pte_marker_entry(...) && // from is_pte_marker()
is_pte_marker_entry(...) // from is_guard_swp_entry()
While a modern compiler could likely optimize this away, let's have clean
code and not rely on it.
Link: https://lkml.kernel.org/r/20250924045830.3817-1-lance.yang@linux.dev
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The sysfs attributes for memory blocks require the node ID to be set and
initialized, so move the node activation before adding new memory blocks.
This also has the nice side effect that the BUG_ON() can be converted into
a WARN_ON() as we now can handle registration errors.
Link: https://lkml.kernel.org/r/20250729064637.51662-3-hare@kernel.org
Fixes: b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add a Kconfig option to redirect dfprintk() to the trace buffer
- Enable use of the RWF_DONTCACHE flag on the NFS client
- Add striped layout handling to pNFS flexfiles
- Add proper localio handling for READ and WRITE O_DIRECT
Bugfixes:
- Handle NFS4ERR_GRACE errors during delegation recall
- Fix NFSv4.1 backchannel max_resp_sz verification check
- Fix mount hang after CREATE_SESSION failure
- Fix d_parent->d_inode locking in nfs4_setup_readdir()
Other Cleanups and Improvements:
- Improvements to write handling tracepoints
- Fix a few trivial spelling mistakes
- Cleanups to the rpcbind cleanup call sites
- Convert the SUNRPC xdr_buf to use a scratch folio instead of
scratch page
- Remove unused NFS_WBACK_BUSY() macro
- Remove __GFP_NOWARN flags
- Unexport rpc_malloc() and rpc_free()"
* tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits)
NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
nfs/localio: add proper O_DIRECT support for READ and WRITE
nfs/localio: refactor iocb initialization
nfs/localio: refactor iocb and iov_iter_bvec initialization
nfs/localio: avoid issuing misaligned IO using O_DIRECT
nfs/localio: make trace_nfs_local_open_fh more useful
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
sunrpc: unexport rpc_malloc() and rpc_free()
NFSv4/flexfiles: Add support for striped layouts
NFSv4/flexfiles: Update layout stats & error paths for striped layouts
NFSv4/flexfiles: Write path updates for striped layouts
NFSv4/flexfiles: Commit path updates for striped layouts
NFSv4/flexfiles: Read path updates for striped layouts
NFSv4/flexfiles: Update low level helper functions to be DS stripe aware.
NFSv4/flexfiles: Add data structure support for striped layouts
NFSv4/flexfiles: Use ds_commit_idx when marking a write commit
NFSv4/flexfiles: Remove cred local variable dependency
nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing
NFS: Enable use of the RWF_DONTCACHE flag on the NFS client
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Extend copy_file_range interface to be fully 64bit capable (Miklos)
- Add selftest for fusectl (Chen Linxuan)
- Move fuse docs into a separate directory (Bagas Sanjaya)
- Allow fuse to enter freezable state in some cases (Sergey
Senozhatsky)
- Clean up writeback accounting after removing tmp page copies (Joanne)
- Optimize virtiofs request handling (Li RongQing)
- Add synchronous FUSE_INIT support (Miklos)
- Allow server to request prune of unused inodes (Miklos)
- Fix deadlock with AIO/sync release (Darrick)
- Add some prep patches for block/iomap support (Darrick)
- Misc fixes and cleanups
* tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits)
fuse: move CREATE_TRACE_POINTS to a separate file
fuse: move the backing file idr and code into a new source file
fuse: enable FUSE_SYNCFS for all fuseblk servers
fuse: capture the unique id of fuse commands being sent
fuse: fix livelock in synchronous file put from fuseblk workers
mm: fix lockdep issues in writeback handling
fuse: add prune notification
fuse: remove redundant calls to fuse_copy_finish() in fuse_notify()
fuse: fix possibly missing fuse_copy_finish() call in fuse_notify()
fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h>
fuse: remove fuse_readpages_end() null mapping check
fuse: fix references to fuse.rst -> fuse/fuse.rst
fuse: allow synchronous FUSE_INIT
fuse: zero initialize inode private data
fuse: remove unused 'inode' parameter in fuse_passthrough_open
virtio_fs: fix the hash table using in virtio_fs_enqueue_req()
mm: remove BDI_CAP_WRITEBACK_ACCT
fuse: use default writeback accounting
virtio_fs: Remove redundant spinlock in virtio_fs_request_complete()
fuse: remove unneeded offset assignment when filling write pages
...
|
|
In slab_post_alloc_hook(), kmemleak check is skipped when
gfpflags_allow_spinning() returns false since commit af92793e52c3
("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Therefore, unconditionally calling kmemleak_not_leak() in
alloc_slab_obj_exts() would trigger the following warning:
kmemleak: Trying to color unknown object at 0xffff8881057f5000 as Grey
Call Trace:
alloc_slab_obj_exts+0x1b5/0x370
__alloc_tagging_slab_alloc_hook+0x9f/0x2d0
__kmalloc_cache_noprof+0x1c4/0x5c0
__set_page_owner+0x10d/0x1c0
post_alloc_hook+0x84/0xf0
get_page_from_freelist+0x73b/0x1380
__alloc_frozen_pages_noprof+0x110/0x2c0
alloc_pages_mpol+0x44/0x140
alloc_slab_page+0xac/0x150
allocate_slab+0x78/0x3a0
___slab_alloc+0x76b/0xed0
__slab_alloc.constprop.0+0x5a/0xb0
Add the allow_spin check in alloc_slab_obj_exts() to eliminate the above
warning.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250930063831.782815-1-ranxiaokai627@163.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
KMSAN reports an uninitialized value issue in dma_map_phys()[1]. This
is a false positive caused by the way the virtual address is handled
in kmsan_handle_dma(). Fix it by translating the physical address to
a virtual address using phys_to_virt().
[1]
BUG: KMSAN: uninit-value in dma_map_phys+0xdc5/0x1060
dma_map_phys+0xdc5/0x1060
dma_map_page_attrs+0xcf/0x130
e1000_xmit_frame+0x3c51/0x78f0
dev_hard_start_xmit+0x22f/0xa30
sch_direct_xmit+0x3b2/0xcf0
__dev_queue_xmit+0x3588/0x5e60
neigh_resolve_output+0x9c5/0xaf0
ip6_finish_output2+0x24e0/0x2d30
ip6_finish_output+0x903/0x10d0
ip6_output+0x331/0x600
mld_sendpack+0xb4a/0x1770
mld_ifc_work+0x1328/0x19b0
process_scheduled_works+0xb91/0x1d80
worker_thread+0xedf/0x1590
kthread+0xd5c/0xf00
ret_from_fork+0x1f5/0x4c0
ret_from_fork_asm+0x1a/0x30
Uninit was created at:
__kmalloc_cache_noprof+0x8f5/0x16b0
syslog_print+0x9a/0xef0
do_syslog+0x849/0xfe0
__x64_sys_syslog+0x97/0x100
x64_sys_call+0x3cf8/0x3e30
do_syscall_64+0xd9/0xfa0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Bytes 0-89 of 90 are uninitialized
Memory access of size 90 starts at ffff8880367ed000
CPU: 1 UID: 0 PID: 1552 Comm: kworker/1:2 Not tainted 6.17.0-next-20250929 #26 PREEMPT(none)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
Workqueue: mld mld_ifc_work
Fixes: 6eb1e769b2c1 ("kmsan: convert kmsan_handle_dma to use physical addresses")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251002051024.3096061-1-syoshida@redhat.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- A new layer for caching objects for allocation and free via percpu
arrays called sheaves.
The aim is to combine the good parts of SLAB (lower-overhead and
simpler percpu caching, compared to SLUB) without the past issues
with arrays for freeing remote NUMA node objects and their flushing.
It also allows more efficient kfree_rcu(), and cheaper object
preallocations for cases where the exact number of objects is
unknown, but an upper bound is.
Currently VMAs and maple nodes are using this new caching, with a
plan to enable it for all caches and remove the complex SLUB fastpath
based on cpu (partial) slabs and this_cpu_cmpxchg_double().
(Vlastimil Babka, with Liam Howlett and Pedro Falcato for the maple
tree changes)
- Re-entrant kmalloc_nolock(), which allows opportunistic allocations
from NMI and tracing/kprobe contexts.
Building on prior page allocator and memcg changes, it will result in
removing BPF-specific caches on top of slab (Alexei Starovoitov)
- Various fixes and cleanups. (Kuan-Wei Chiu, Matthew Wilcox, Suren
Baghdasaryan, Ye Liu)
* tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (40 commits)
slab: Introduce kmalloc_nolock() and kfree_nolock().
slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
slab: Make slub local_(try)lock more precise for LOCKDEP
mm: Introduce alloc_frozen_pages_nolock()
mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
locking/local_lock: Introduce local_lock_is_locked().
maple_tree: Convert forking to use the sheaf interface
maple_tree: Add single node allocation support to maple state
maple_tree: Prefilled sheaf conversion and testing
tools/testing: Add support for prefilled slab sheafs
maple_tree: Replace mt_free_one() with kfree()
maple_tree: Use kfree_rcu in ma_free_rcu
testing/radix-tree/maple: Hack around kfree_rcu not existing
tools/testing: include maple-shim.c in maple.c
maple_tree: use percpu sheaves for maple_node_cache
mm, vma: use percpu sheaves for vm_area_struct cache
tools/testing: Add support for changes to slab for sheaves
slab: allow NUMA restricted allocations to use percpu sheaves
tools/testing/vma: Implement vm_refcnt reset
slab: skip percpu sheaves for remote object freeing
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- Improve drop account scalability on NUMA hosts for RAW and UDP
sockets and the backlog, almost doubling the Pps capacity under DoS
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance
by an additional 50%, even more under extreme conditions
- Add support for PSP encryption of TCP connections; this mechanism
has some similarities with IPsec and TLS, but offers superior HW
offloads capabilities
- Ongoing work to support Accurate ECN for TCP. AccECN allows more
than one congestion notification signal per RTT and is a building
block for Low Latency, Low Loss, and Scalable Throughput (L4S)
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath
- Refactor skb deferral free to better scale on large multi-NUMA
hosts, this improves TCP and UDP RX performances significantly on
such HW
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds
- Improve handling of setups with a large number of nexthop, making
dump operating scaling linearly and avoiding unneeded
synchronize_rcu() on delete
- Improve bridge handling of VLAN FDB, storing a single entry per
bridge instead of one entry per port; this makes the dump order of
magnitude faster on large switches
- Restore IP ID correctly for encapsulated packets at GSO
segmentation time, allowing GRO to merge packets in more scenarios
- Improve netfilter matching performance on large sets
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting
recent TCP autotuning changes
- Allow bridges to redirect to a backup port when the bridge port is
administratively down
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups
- Add RCU safety to dst->dev, closing a lot of possible races
- A significant crypto library API for SCTP, MPTCP and IPv6 SR,
reducing code duplication
- Supports pulling data from an skb frag into the linear area of an
XDP buffer
Things we sprinkled into general kernel code:
- Generate netlink documentation from YAML using an integrated YAML
parser
Driver API:
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection
- Introduce API for fetching the DMA device for a given queue,
allowing TCP zerocopy RX on more H/W setups
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath
- Add a new dedicated ethtool callback enabling drivers to provide
the number of RX rings directly, improving efficiency and clarity
in RX ring queries and RSS configuration
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause
- Support for DPLL phase offset exponential moving average,
controlling the average smoothing factor
Device drivers:
- Add a new Huawei driver for 3rd gen NIC (hinic3)
- Add a new SpacemiT driver for K1 ethernet MAC
- Add a generic abstraction for shared memory communication
devices (dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention
issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link
aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level
checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages
to improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4):
- support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory
usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k):
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"
* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
net: stmmac: Add support for Allwinner A523 GMAC200
dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
Revert "Documentation: net: add flow control guide and document ethtool API"
octeontx2-pf: fix bitmap leak
octeontx2-vf: fix bitmap leak
net/mlx5e: Use extack in set rxfh callback
net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
net/mlx5e: Introduce mlx5e_rss_init_params
net/mlx5e: Remove unused mdev param from RSS indir init
net/mlx5: Improve QoS error messages with actual depth values
net/mlx5e: Prevent entering switchdev mode with inconsistent netns
net/mlx5: HWS, Generalize complex matchers
net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
selftests/net: add tcp_port_share to .gitignore
Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
net: add NUMA awareness to skb_attempt_defer_free()
net: use llist for sd->defer_list
net: make softnet_data.defer_count an atomic
selftests: drv-net: psp: add tests for destroying devices
selftests: drv-net: psp: add test for auto-adjusting TCP MSS
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's good stuff across the board, including some nice mm
improvements for CPUs with the 'noabort' BBML2 feature and a clever
patch to allow ptdump to play nicely with block mappings in the
vmalloc area.
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace
- Extend Spectre workarounds to cover additional Arm CPU variants
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of
mapping granularity (BBML2_NOABORT)
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled
- Tidy up the types used in our early MMU setup code
- Rework rodata= for closer parity with the behaviour on x86
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths
- Re-enable warnings when building the 32-bit vDSO object
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU
- Support for Hisilicon's MN and NoC PMUs
- Support for Fujitsu's Uncore PMU
- Support for SPE's extended event filtering feature
- Preparatory work to enable data source filtering in SPE
- Support for multiple lanes in the DWC PCIe PMU
- Support for i.MX94 in the IMX DDR PMU driver
- MAINTAINERS update (Thank you, Yicong)
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test
- Support nolibc in GCS tests
- Extend SVE ptrace test to pass unsupported regsets and invalid
vector lengths
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1
- Sync TCR_EL1 definition with the latest Arm ARM (L.b)
- Be stricter about the input fed into our AWK sysreg generator
script
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: cpufeature: Remove duplicate asm/mmu.h header
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
perf/dwc_pcie: Fix use of uninitialized variable
arm/syscalls: mark syscall invocation as likely in invoke_syscall
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64/sysreg: Update TCR_EL1 register
arm64: Enable vmalloc-huge with ptdump
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
arm64: errata: Apply workarounds for Neoverse-V3AE
arm64: cputype: Add Neoverse-V3AE definitions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner:
"This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity
of switching all the inodes is quadratic leading to workers being
pegged for hours consuming 100% of the CPU and switching inodes to the
parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
- Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb
- Allowing rescheduling when switching inodes over to a different
cgroup to avoid softlockups
- Maintaining the b_dirty list ordering instead of sorting it"
* tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
writeback: Add tracepoint to track pending inode switches
writeback: Avoid excessively long inode switching times
writeback: Avoid softlockup when switching many inodes
writeback: Avoid contention on wb->list_lock when switching inodes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
From the cover letter [1]:
This patch set introduces kmalloc_nolock() which is the next logical
step towards any context allocation necessary to remove bpf_mem_alloc
and get rid of preallocation requirement in BPF infrastructure.
In production BPF maps grew to gigabytes in size. Preallocation wastes
memory. Alloc from any context addresses this issue for BPF and other
subsystems that are forced to preallocate too.
This long task started with introduction of alloc_pages_nolock(), then
memcg and objcg were converted to operate from any context including
NMI, this set completes the task with kmalloc_nolock() that builds on
top of alloc_pages_nolock() and memcg changes.
After that BPF subsystem will gradually adopt it everywhere.
Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1]
|
|
This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). The plan is therefore that it will be later enabled for
all kmem caches and replace the complicated cpu (partial) slabs code.
Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".
This caching may seem similar to the arrays we had in SLAB, but there
are some important differences:
- deals differently with NUMA locality of freed objects, thus there are
no per-node "shared" arrays (with possible lock contention) and no
"alien" arrays that would need periodical flushing
- instead, freeing remote objects (which is rare) bypasses the sheaves
- percpu sheaves thus contain only local objects (modulo rare races
and local node exhaustion)
- NUMA restricted allocations and strict_numa mode is still honoured
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
for guaranteed and efficient allocations in a restricted context, when
the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)
The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
thanks to local_trylock() it becomes a preempt_disable() and no atomic
operations. Same for freeing, which is otherwise a local double cmpxchg
only for short term allocations (so the same slab is still active on the
same cpu when freeing the object) and a more costly locked double
cmpxchg otherwise.
- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
separate percpu sheaf and only submit the whole sheaf to call_rcu()
when full. After the grace period, the sheaf can be used for
allocations, which is more efficient than freeing and reallocating
individual slab objects (even with the batching done by kfree_rcu()
implementation itself). In case only some cpus are allowed to handle rcu
callbacks, the sheaf can still be made available to other cpus on the
same node via the shared barn. The maple_node cache uses kfree_rcu() and
thus can benefit from this.
Note: this path is currently limited to !PREEMPT_RT
- Preallocation support. A prefilled sheaf can be privately borrowed to
perform a short term operation that is not allowed to block in the
middle and may need to allocate some objects. If an upper bound (worst
case) for the number of allocations is known, but only much fewer
allocations actually needed on average, borrowing and returning a sheaf
is much more efficient then a bulk allocation for the worst case
followed by a bulk free of the many unused objects. Maple tree write
operations should benefit from this.
- Compatibility with slub_debug. When slub_debug is enabled for a cache,
we simply don't create the percpu sheaves so that the debugging hooks
(at the node partial list slowpaths) are reached as before. The same
thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
reusing the (ineffective) paths for requests exceeding the cache's
sheaf_capacity. This is in line with the existing approach where
debugging bypasses the fast paths and SLUB_TINY preferes memory
savings over performance.
The above is adapted from the cover letter [1], which contains also
in-kernel microbenchmark results showing the lower overhead of sheaves.
Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark
also show improvements.
Results from Sudarsan Mahendran [3] using will-it-scale show both
benefits and regressions, probably due to overall noisiness of those
tests.
Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1]
Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2]
Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
|
|
kmalloc_nolock() relies on ability of local_trylock_t to detect
the situation when per-cpu kmem_cache is locked.
In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
disables IRQs and marks s->cpu_slab->lock as acquired.
local_lock_is_locked(&s->cpu_slab->lock) returns true when
slab is in the middle of manipulating per-cpu cache
of that specific kmem_cache.
kmalloc_nolock() can be called from any context and can re-enter
into ___slab_alloc():
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
or
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
can be acquired without a deadlock before invoking the function.
If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
retries in a different kmalloc bucket. The second attempt will
likely succeed, since this cpu locked different kmem_cache.
Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current _task_. In this case
re-entrance into the same kmalloc bucket is unsafe, and
kmalloc_nolock() tries a different bucket that is most likely is
not locked by the current task. Though it may be locked by a
different task it's safe to rt_spin_lock() and sleep on it.
Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.
kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and (in_nmi() or in PREEMPT_RT).
SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate,
while kfree_nolock() always defers to irq_work.
Note, kfree_nolock() must be called _only_ for objects allocated
with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
will miss kmemleak/kfence book keeping and will cause false positives.
large_kmalloc is not supported by either kmalloc_nolock()
or kfree_nolock().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Since the combination of valid upper bits in slab->obj_exts with
OBJEXTS_ALLOC_FAIL bit can never happen,
use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
instead of (1ull << 2) to free up bit 2.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
kmalloc_nolock() can be called from any context
the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
in PREEMPT_RT) and attempt to acquire a different local_trylock_t
while in the same task context.
The calling sequence might look like:
kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()
or more precisely:
__lock_acquire+0x12ad/0x2590
lock_acquire+0x133/0x2d0
rt_spin_lock+0x6f/0x250
___slab_alloc+0xb7/0xec0
kmalloc_nolock_noprof+0x15a/0x430
my_debug_callback+0x20e/0x390 [testmod]
___slab_alloc+0x256/0xec0
__kmalloc_cache_noprof+0xd6/0x3b0
Make LOCKDEP understand that local_trylock_t-s protect
different kmem_caches. In order to do that add lock_class_key
for each kmem_cache and use that key in local_trylock_t.
This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
but teach lockdep about it only for PREEMPT_RT, since
in !PREEMPT_RT the ___slab_alloc() code is using
local_trylock_irqsave() when lockdep is on.
Note, this patch applies this logic to local_lock_t
while the next one converts it to local_trylock_t.
Both are mapped to rt_spin_lock in PREEMPT_RT.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
to be used by alloc_slab_page().
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Change alloc_pages_nolock() to default to __GFP_COMP when allocating
pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
Also allow __GFP_ACCOUNT flag to be specified,
since most of BPF infra needs __GFP_ACCOUNT except BPF streams.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Create the vm_area_struct cache with percpu sheaves of size 32 to
improve its performance.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Currently allocations asking for a specific node explicitly or via
mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
contain mostly local objects, we can try allocating from them if the
local node happens to be the requested node or allowed by the mempolicy.
If we find the object from percpu sheaves is not from the expected node,
we skip the sheaves - this should be rare.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Since we don't control the NUMA locality of objects in percpu sheaves,
allocations with node restrictions bypass them. Allocations without
restrictions may however still expect to get local objects with high
probability, and the introduction of sheaves can decrease it due to
freed object from a remote node ending up in percpu sheaves.
The fraction of such remote frees seems low (5% on an 8-node machine)
but it can be expected that some cache or workload specific corner cases
exist. We can either conclude that this is not a problem due to the low
fraction, or we can make remote frees bypass percpu sheaves and go
directly to their slabs. This will make the remote frees more expensive,
but if it's only a small fraction, most frees will still benefit from
the lower overhead of percpu sheaves.
This patch thus makes remote object freeing bypass percpu sheaves,
including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
it's not intended to be 100% guarantee that percpu sheaves will only
contain local objects. The refill from slabs does not provide that
guarantee in the first place, and there might be cpu migrations
happening when we need to unlock the local_lock. Avoiding all that could
be possible but complicated so we can leave it for later investigation
whether it would be worth it. It can be expected that the more selective
freeing will itself prevent accumulation of remote objects in percpu
sheaves so any such violations would have only short-term effects.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.
Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.
kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.
kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.
kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.
kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.
kmem_cache_sheaf_size() can be used to query the current size.
The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such "oversize" sheaves are
allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
might be especially ineffective when replacing a sheaf with a new one of
a larger capacity. It is therefore better to size cache's
sheaf_capacity accordingly to make oversize sheaves exceptional.
CONFIG_SLUB_STATS counters are added for sheaf prefill and return
operations. A prefill or return is considered _fast when it is able to
grab or return a percpu spare sheaf (even if the sheaf needs a refill to
satisfy the request, as those should amortize over time), and _slow
otherwise (when the barn or even sheaf allocation/freeing has to be
involved). sheaf_prefill_oversize is provided to determine how many
prefills were oversize (counter for oversize returns is not necessary as
all oversize refills result in oversize returns).
When slub_debug is enabled for a cache with sheaves, no percpu sheaves
exist for it, but the prefill functionality is still provided simply by
all prefilled sheaves becoming oversize. If percpu sheaves are not
created for a cache due to not passing the sheaf_capacity argument on
cache creation, the prefills also work through oversize sheaves, but
there's a WARN_ON_ONCE() to indicate the omission.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.
Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these
redundant flags across subsystems.
No functional changes.
Link: https://lkml.kernel.org/r/20250804125657.482109-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.
kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.
It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.
Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
existing batching
- sheaves can be reused for allocations via barn instead of being
flushed to slabs, which is more efficient
- this includes cases where only some cpus are allowed to process rcu
callbacks (CONFIG_RCU_NOCB_CPU)
Possible disadvantage:
- objects might be waiting for more than their grace period (it is
determined by the last object freed into the sheaf), increasing memory
usage - but the existing batching does that too.
Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.
Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.
Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.
Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
It is possible to hit a zero entry while traversing the vmas in unuse_mm()
called from swapoff path and accessing it causes the OOPS:
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000446--> Loading the memory from offset 0x40 on the
XA_ZERO_ENTRY as address.
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
The issue is manifested from the below race between the fork() on a
process and swapoff:
fork(dup_mmap()) swapoff(unuse_mm)
--------------- -----------------
1) Identical mtree is built using
__mt_dup().
2) copy_pte_range()-->
copy_nonpresent_pte():
The dst mm is added into the
mmlist to be visible to the
swapoff operation.
3) Fatal signal is sent to the parent
process(which is the current during the
fork) thus skip the duplication of the
vmas and mark the vma range with
XA_ZERO_ENTRY as a marker for this process
that helps during exit_mmap().
4) swapoff is tried on the
'mm' added to the 'mmlist' as
part of the 2.
5) unuse_mm(), that iterates
through the vma's of this 'mm'
will hit the non-NULL zero entry
and operating on this zero entry
as a vma is resulting into the
oops.
The proper fix would be around not exposing this partially-valid tree to
others when droping the mmap lock, which is being solved with [1]. A
simpler solution would be checking for MMF_UNSTABLE, as it is set if
mm_struct is not fully initialized in dup_mmap().
Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
issue.
Link: https://lkml.kernel.org/r/20250924181138.1762750-1-charan.kalla@oss.qualcomm.com
Link: https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/ [1]
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When collapsing a pmd, there are two address in use:
* address points to the start of pmd
* address points to each individual page
Current naming makes it difficult to distinguish these two and is hence
error prone.
Considering the plan to collapse mTHP, name the first one `start_addr' and
the second `addr' for better readability and consistency.
Link: https://lkml.kernel.org/r/20250922140938.27343-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Nico Pache <npache@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is a boot failure when both CONFIG_DEBUG_KMEMLEAK and
CONFIG_MEM_ALLOC_PROFILING are enabled.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:__alloc_tagging_slab_alloc_hook+0x181/0x2f0
Call Trace:
kmem_cache_alloc_noprof+0x1c8/0x5c0
__alloc_object+0x2f/0x290
__create_object+0x22/0x80
kmemleak_init+0x122/0x190
mm_core_init+0xb6/0x160
start_kernel+0x39f/0x920
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0x104/0x120
common_startup_64+0x12c/0x138
In kmemleak, mem_pool_alloc() directly calls kmem_cache_alloc_noprof(), as
a result, current->alloc_tag is NULL, leading to a null pointer
dereference.
Move the checks for SLAB_NO_OBJ_EXT, SLAB_NOLEAKTRACE, and
__GFP_NO_OBJ_EXT to the parent function __alloc_tagging_slab_alloc_hook()
to fix this.
Also this distinguishes the SLAB_NOLEAKTRACE case between the actual
memory allocation failures case, make CODETAG_FLAG_INACCURATE more
accurate.
Link: https://lkml.kernel.org/r/20250926080659.741991-1-ranxiaokai627@163.com
Fixes: b9e2f58ffb84 ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We added that "select MEMORY_ISOLATION" in commit ee6f509c3274 ("mm:
factor out memory isolate functions"). However, in commit add05cecef80
("mm: soft-offline: don't free target page in successful page migration")
we remove the need for it, where we removed the calls to
set_migratetype_isolate() etc.
What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages()
support. But that comes with CONFIG_MIGRATION. And
isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION.
Therefore, we can remove "select MEMORY_ISOLATION" of MEMORY_FAILURE.
Link: https://lkml.kernel.org/r/20250922143618.48640-1-xieyuanbin1@huawei.com
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Current code is not correct to get struct khugepaged_mm_slot by
mm_slot_entry() without checking mm_slot is !NULL. There is no problem
reported since slot is the first element of struct khugepaged_mm_slot.
While struct khugepaged_mm_slot is just a wrapper of struct mm_slot, there
is no need to define it.
Remove the definition of struct khugepaged_mm_slot, so there is not chance
to miss use mm_slot_entry().
[richard.weiyang@gmail.com: fix use-after-free crash]
Link: https://lkml.kernel.org/r/20250922002834.vz6ntj36e75ehkyp@master
Link: https://lkml.kernel.org/r/20250919071244.17020-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm_slot: fix the usage of mm_slot_entry", v2.
When using mm_slot in ksm, there is code like:
slot = mm_slot_lookup(mm_slots_hash, mm);
mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot);
if (mm_slot && ..) {
}
The mm_slot_entry() won't return a valid value if slot is NULL generally.
But currently it works since slot is the first element of struct
ksm_mm_slot.
To reduce the ambiguity and make it robust, access mm_slot_entry() when
slot is !NULL.
Link: https://lkml.kernel.org/r/20250919071244.17020-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250919071244.17020-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 79359d6d24df ("hugetlb: perform vmemmap optimization on a list of
pages") batches the submission of HugeTLB vmemmap optimization (HVO)
during hugepage reservation. With HVO enabled, hugepages obtained from
the buddy allocator are not submitted for optimization and their
struct-page memory is therefore not released—until the entire
reservation request has been satisfied. As a result, any struct-page
memory freed in the course of the allocation cannot be reused for the
ongoing reservation, artificially limiting the number of huge pages that
can ultimately be provided.
As commit b1222550fbf7 ("mm/hugetlb: do pre-HVO for bootmem allocated
pages") already applies early HVO to bootmem-allocated huge pages, this
patch extends the same benefit to non-bootmem pages by incrementally
submitting them for HVO as they are allocated, thereby returning
struct-page memory to the buddy allocator in real time. The change raises
the maximum 2 MiB hugepage reservation from just under 376 GB to more than
381 GB on a 384 GB x86 VM.
Link: https://lkml.kernel.org/r/20250919092353.41671-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When using vmalloc with VM_ALLOW_HUGE_VMAP flag, it will set the alignment
to PMD_SIZE internally, if it deems huge mappings to be eligible.
Therefore, setting the alignment in execmem_vmalloc is redundant. Apart
from this, it also reduces the probability of allocation in case vmalloc
fails to allocate hugepages - in the fallback case, vmalloc tries to use
the original alignment and allocate basepages, which unfortunately will
again be PMD_SIZE passed over from execmem_vmalloc, thus constraining the
search for a free space in vmalloc region.
Therefore, remove this constraint.
Link: https://lkml.kernel.org/r/20250918093453.75676-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Link: https://lkml.kernel.org/r/20250918174528.90879-1-manish1588@gmail.com
Signed-off-by: Manish Kumar <manish1588@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel currently does not mlock large folios when adding them to rmap,
stating that it is difficult to confirm that the folio is fully mapped and
safe to mlock it.
This leads to a significant undercount of Mlocked in /proc/meminfo,
causing problems in production where the stat was used to estimate system
utilization and determine if load shedding is required.
However, nowadays the caller passes a number of pages of the folio that
are getting mapped, making it easy to check if the entire folio is mapped
to the VMA.
mlock the folio on rmap if it is fully mapped to the VMA.
Mlocked in /proc/meminfo can still undercount, but the value is closer the
truth and is useful for userspace.
Link: https://lkml.kernel.org/r/20250923110711.690639-7-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, kernel only maps part of large folio that fits into
start_pgoff/end_pgoff range.
Map entire folio where possible. It will match finish_fault() behaviour
that user hits on cold page cache.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-6-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
finish_fault() uses per-page fault for file folios. This only occurs for
file folios smaller than PMD_SIZE.
The comment suggests that this approach prevents RSS inflation. However,
it only prevents RSS accounting. The folio is still mapped to the
process, and the fact that it is mapped by a single PTE does not affect
memory pressure. Additionally, the kernel's ability to map large folios
as PMD if they are large enough does not support this argument.
When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, try_to_unmap_once() only tries to mlock small folios.
Use logic similar to folio_referenced_one() to mlock large folios: only do
this for fully mapped folios and under page table lock that protects all
page table entries.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-4-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The mlock_vma_folio() function requires the page table lock to be held in
order to safely mlock the folio. However, folio_referenced_one() mlocks a
large folios outside of the page_vma_mapped_walk() loop where the page
table lock has already been dropped.
Rework the mlock logic to use the same code path inside the loop for both
large and small folios.
Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a page
table boundary.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-3-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Improve mlock tracking for large folios", v3.
The patchset includes several fixes and improvements related to mlock
tracking of large folios.
The main objective is to reduce the undercount of Mlocked memory in
/proc/meminfo and improve the accuracy of the statistics.
Patches 1-2:
These patches address a minor race condition in folio_referenced_one()
related to mlock_vma_folio().
Currently, mlock_vma_folio() is called on large folio without the page
table lock, which can result in a race condition with unmap (i.e.
MADV_DONTNEED). This can lead to partially mapped folios on the
unevictable LRU list.
While not a significant issue, I do not believe backporting is necessary.
Patch 3:
This patch adds mlocking logic similar to folio_referenced_one() to
try_to_unmap_one(), allowing for mlocking of large folios where possible.
Patch 4-5:
These patches modifies finish_fault() and faultaround to map in the entire
folio when possible, enabling efficient mlocking upon addition to the
rmap.
Patch 6:
This patch makes rmap mlock large folios if they are fully mapped,
addressing the primary source of mlock undercount for large folios.
This patch (of 6):
Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
the page is mapped across page table boundary. Unlike other PVMW_* flags,
this one is result of page_vma_mapped_walk() and not set by the caller.
folio_referenced_one() will use it to detect if it safe to mlock the
folio.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 56ae0bb349b4 ("mm: compaction: convert to use a folio in
isolate_migratepages_block()") converts api from page to folio. But the
low_pfn advance for hugetlb page seems wrong when low_pfn doesn't point to
head page.
Originally, if page is a hugetlb tail page, compound_nr() return 1, which
means low_pfn only advance one in next iteration. After the change,
low_pfn would advance more than the hugetlb range, since folio_nr_pages()
always return total number of the large page. This results in skipping
some range to isolate and then to migrate.
The worst case for alloc_contig is it does all the isolation and
migration, but finally find some range is still not isolated. And then
undo all the work and try a new range.
Advance low_pfn to the end of hugetlb.
Link: https://lkml.kernel.org/r/20250910092240.3981-1-richard.weiyang@gmail.com
Fixes: 56ae0bb349b4 ("mm: compaction: convert to use a folio in isolate_migratepages_block()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues
or aren't considered necessary for -stable kernels. 6 of these fixes
are for MM.
All singletons, please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines
mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call()
mailmap: add entry for Bence Csókás
fs/proc/task_mmu: check p->vec_buf for NULL
kmsan: fix out-of-bounds access to shadow memory
mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count
mm/hugetlb: fix folio is still mapped when deleted
|
|
Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.
Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.
When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.
The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.
The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.
The sheaf_capacity value is exported in sysfs for observability.
Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs. For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters. Finally, there are sheaf_alloc/sheaf_free counters.
Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.
When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed. Similarly, we ignore it
with CONFIG_SLUB_TINY which prefers low memory usage to performance.
[boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]
Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
We don't need to call free_kmem_cache_nodes() immediately when failing
to allocate a kmem_cache_node, because when we return 0,
do_kmem_cache_create() calls __kmem_cache_release() which also performs
free_kmem_cache_nodes().
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
damon_sysfs_damon_call()
The callback return value is ignored in damon_sysfs_damon_call(), which
means that it is not possible to detect invalid user input when writing
commands such as 'commit' to
/sys/kernel/mm/damon/admin/kdamonds/<K>/state. Fix it.
Link: https://lkml.kernel.org/r/20250920132546.5822-1-akinobu.mita@gmail.com
Fixes: f64539dcdb87 ("mm/damon/sysfs: use damon_call() for update_schemes_stats")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():
BUG: unable to handle page fault for address: ffffbc3840291000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
[...]
Call Trace:
<TASK>
__msan_memset+0xee/0x1a0
sha224_final+0x9e/0x350
test_hash_buffer_overruns+0x46f/0x5f0
? kmsan_get_shadow_origin_ptr+0x46/0xa0
? __pfx_test_hash_buffer_overruns+0x10/0x10
kunit_try_run_case+0x198/0xa00
This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e. the next page is unmapped.
The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned. Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer. However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address. This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.
To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.
Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec44c60 ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared
count") introduced ->pt_share_count dedicated to hugetlb PMD share count
tracking, but omitted fixing copy_hugetlb_page_range(), leaving the
function relying on page_count() for tracking that no longer works.
When lazy page table copy for hugetlb is disabled, that is, revert commit
bcd51a3c679d ("hugetlb: lazy page table copies in fork()") fork()'ing with
hugetlb PMD sharing quickly lockup -
[ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s!
[ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0
[ 239.446631] Call Trace:
[ 239.446633] <TASK>
[ 239.446636] _raw_spin_lock+0x3f/0x60
[ 239.446639] copy_hugetlb_page_range+0x258/0xb50
[ 239.446645] copy_page_range+0x22b/0x2c0
[ 239.446651] dup_mmap+0x3e2/0x770
[ 239.446654] dup_mm.constprop.0+0x5e/0x230
[ 239.446657] copy_process+0xd17/0x1760
[ 239.446660] kernel_clone+0xc0/0x3e0
[ 239.446661] __do_sys_clone+0x65/0xa0
[ 239.446664] do_syscall_64+0x82/0x930
[ 239.446668] ? count_memcg_events+0xd2/0x190
[ 239.446671] ? syscall_trace_enter+0x14e/0x1f0
[ 239.446676] ? syscall_exit_work+0x118/0x150
[ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0
[ 239.446681] ? clear_bhb_loop+0x30/0x80
[ 239.446684] ? clear_bhb_loop+0x30/0x80
[ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e
There are two options to resolve the potential latent issue:
1. warn against PMD sharing in copy_hugetlb_page_range(),
2. fix it.
This patch opts for the second option.
While at it, simplify the comment, the details are not actually relevant
anymore.
Link: https://lkml.kernel.org/r/20250916004520.1604530-1-jane.chu@oracle.com
Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_ctx->addr_unit is respected only for physical address space
monitoring use case. Meanwhile, damon_ctx->min_sz_region is used by the
core layer for aligning regions, regardless of whether it is set for
physical address space monitoring or virtual address spaces monitoring.
And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'. Hence, if
user sets ->addr_unit on virtual address spaces monitoring mode, regions
can be unexpectedly aligned in <PAGE_SIZE granularity. It shouldn't cause
crash-like issues but make monitoring and DAMOS behavior difficult to
understand.
Fix the unexpected behavior by setting ->min_sz_region only when it is
configured for physical address space monitoring.
The issue was found from a result of Chris' experiments that thankfully
shared with me off-list.
Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org
Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently vm_area_alloc_pages() contains two cond_resched() points.
However, the page allocator already has its own in slow path so an extra
resched is not optimal because it delays the loops.
The place where CPU time can be consumed is in the VA-space search in
alloc_vmap_area(), especially if the space is really fragmented using
synthetic stress tests, after a fast path falls back to a slow one.
Move a single cond_resched() there, after dropping free_vmap_area_lock in
a slow path. This keeps fairness where it matters while removing
redundant yields from the page-allocation path.
[akpm@linux-foundation.org: tweak comment grammar]
Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This removes the last call to page_stable_node(), so delete the wrapper.
It also removes a call to trylock_page() and saves a call to
compound_head(), as well as removing a reference to folio->page.
Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Longlong Xia <xialonglong@kylinos.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.
Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.
This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.
Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.
To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.
With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.
[gourry@gourry.net: don't retry if no kswapds were active]
Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org
Signed-off-by: Gregory Price <gourry@gourry.net>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix an incorrect logic conversion in process_mrelease().
Link: https://lkml.kernel.org/r/3b7f0faf-4dbc-4d67-8a71-752fbcdf0906@lucifer.local
Fixes: 12e423ba4eae ("mm: convert core mm to mm_flags_*() accessors")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Chris Mason <clm@meta.com>
Closes: https://lkml.kernel.org/r/c2e28e27-d84b-4671-8784-de5fe0d14f41@lucifer.local
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if PMD
page table is installed or not.
Consider following example:
p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
err = madvise(p, 2UL << 20, MADV_COLLAPSE);
fd is a populated tmpfs file.
The result depends on the address that the kernel returns on mmap(). If
it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.
This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a
page table is missing, which causes collapse_pte_mapped_thp() to fail.
SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.
Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Filesystems such as NFS may need to defer dropbehind until after their
2-stage writes are done. This adds a helper
folio_end_writeback_no_dropbehind() that allows them to release the
writeback flag without immediately dropping the folio.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Add a helper to allow filesystems to attempt to free the 'dropbehind'
folio.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: https://lore.kernel.org/all/5588a06f6d5a2cf6746828e2d36e7ada668b1739.1745381692.git.trond.myklebust@hammerspace.com/
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
In commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().
This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().
However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file. This is not a safe assumption in
all cases.
The only really sane situation in which this matters would be something
like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:
ret = vfs_mmap(obj->base.filp, vma);
if (ret)
return ret;
And then sets the VMA's file to this, should the mmap operation succeed:
vma_set_file(vma, obj->base.filp);
That is - it is the file that is intended to back the VMA mapping.
This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.
However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.
Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.
Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).
If the caller needs to differentiate between the two they therefore now
can.
While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.
This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.
This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.
Also update the VMA selftests accordingly.
Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|