aboutsummaryrefslogtreecommitdiffstats
path: root/fs/buffer.c
AgeCommit message (Collapse)AuthorFilesLines
10 daysMerge tag 'vfs-6.19-rc1.folio' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()
2025-10-31buffer: Use folio_next_pos()Matthew Wilcox (Oracle)1-1/+1
This is one instruction more efficient than open-coding folio_pos() + folio_size(). It's the equivalent of (x + y) << z rather than x << z + y << z. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251024170822.1427218-4-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Coccinelle-based conversion to use ->i_state accessorsMateusz Guzik1-2/+2
All places were patched by coccinelle with the default expecting that ->i_lock is held, afterwards entries got fixed up by hand to use unlocked variants as needed. The script: @@ expression inode, flags; @@ - inode->i_state & flags + inode_state_read(inode) & flags @@ expression inode, flags; @@ - inode->i_state &= ~flags + inode_state_clear(inode, flags) @@ expression inode, flag1, flag2; @@ - inode->i_state &= ~flag1 & ~flag2 + inode_state_clear(inode, flag1 | flag2) @@ expression inode, flags; @@ - inode->i_state |= flags + inode_state_set(inode, flags) @@ expression inode, flags; @@ - inode->i_state = flags + inode_state_assign(inode, flags) @@ expression inode, flags; @@ - flags = inode->i_state + flags = inode_state_read(inode) @@ expression inode, flags; @@ - READ_ONCE(inode->i_state) & flags + inode_state_read(inode) & flags Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19fs/buffer: fix use-after-free when call bh_read() helperYe Bin1-1/+1
There's issue as follows: BUG: KASAN: stack-out-of-bounds in end_buffer_read_sync+0xe3/0x110 Read of size 8 at addr ffffc9000168f7f8 by task swapper/3/0 CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Not tainted 6.16.0-862.14.0.6.x86_64 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <IRQ> dump_stack_lvl+0x55/0x70 print_address_description.constprop.0+0x2c/0x390 print_report+0xb4/0x270 kasan_report+0xb8/0xf0 end_buffer_read_sync+0xe3/0x110 end_bio_bh_io_sync+0x56/0x80 blk_update_request+0x30a/0x720 scsi_end_request+0x51/0x2b0 scsi_io_completion+0xe3/0x480 ? scsi_device_unbusy+0x11e/0x160 blk_complete_reqs+0x7b/0x90 handle_softirqs+0xef/0x370 irq_exit_rcu+0xa5/0xd0 sysvec_apic_timer_interrupt+0x6e/0x90 </IRQ> Above issue happens when do ntfs3 filesystem mount, issue may happens as follows: mount IRQ ntfs_fill_super read_cache_page do_read_cache_folio filemap_read_folio mpage_read_folio do_mpage_readpage ntfs_get_block_vbo bh_read submit_bh wait_on_buffer(bh); blk_complete_reqs scsi_io_completion scsi_end_request blk_update_request end_bio_bh_io_sync end_buffer_read_sync __end_buffer_read_notouch unlock_buffer wait_on_buffer(bh);--> return will return to caller put_bh --> trigger stack-out-of-bounds In the mpage_read_folio() function, the stack variable 'map_bh' is passed to ntfs_get_block_vbo(). Once unlock_buffer() unlocks and wait_on_buffer() returns to continue processing, the stack variable is likely to be reclaimed. Consequently, during the end_buffer_read_sync() process, calling put_bh() may result in stack overrun. If the bh is not allocated on the stack, it belongs to a folio. Freeing a buffer head which belongs to a folio is done by drop_buffers() which will fail to free buffers which are still locked. So it is safe to call put_bh() before __end_buffer_read_notouch(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/20250811141830.343774-1-yebin@huaweicloud.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen1-13/+13
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs/buffer: remove the min and max limit checks in __getblk_slow()Pankaj Raghav1-8/+3
All filesystems will already check the max and min value of their block size during their initialization. __getblk_slow() is a very low-level function to have these checks. Remove them and only check for logical block size alignment. As this check with logical block size alignment might never trigger, add WARN_ON_ONCE() to the check. As WARN_ON_ONCE() will already print the stack, remove the call to dump_stack(). Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250626113223.181399-1-p.raghav@samsung.com Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fs: Remove three arguments from block_write_end()Matthew Wilcox (Oracle)1-4/+3
block_write_end() looks like it can be used as a ->write_end() implementation. However, it can't as it does not unlock nor put the folio. Since it does not use the 'file', 'mapping' nor 'fsdata' arguments, remove them. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23docs/vfs: update references to i_mutex to i_rwsemJunxuan Liao1-1/+1
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19fs/buffer: remove comment about hard sectorsizePankaj Raghav1-3/+2
Commit e1defc4ff0cf ("block: Do away with the notion of hardsect_size") changed hardsect_size to logical block size. The comment on top still says hardsect_size. Remove the comment as the code is pretty clear. While we are at it, format the relevant code. Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250618075821.111459-1-p.raghav@samsung.com Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-26Merge tag 'vfs-6.16-rc1.writepage' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull final writepage conversion from Christian Brauner: "This converts vboxfs from ->writepage() to ->writepages(). This was the last user of the ->writepage() method. So remove ->writepage() completely and all references to it" * tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove aops->writepage mm: Remove swap_writepage() and shmem_writepage() ttm: Call shmem_writeout() from ttm_backup_backup_page() i915: Use writeback_iter() shmem: Add shmem_writeout() writeback: Remove writeback_use_writepage() migrate: Remove call to ->writepage vboxsf: Convert to writepages 9p: Add a migrate_folio method
2025-05-21fs/buffer: optimize discard_buffer()Davidlohr Bueso1-2/+2
While invalidating, the clearing of the bits in discard_buffer() is done in one fully ordered CAS operation. In the past this was done via individual clear_bit(), until e7470ee89f0 (fs: buffer: do not use unnecessary atomic operations when discarding buffers). This implies that there were never strong ordering requirements outside of being serialized by the buffer lock. As such relax the ordering for archs that can benefit. Further, the implied ordering in buffer_unlock() makes current cmpxchg implied barrier redundant due to release semantics. And while in theory the unlock could be part of the bulk clearing, it is best to leave it explicit, but without the double barriers. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/20250515173925.147823-5-dave@stgolabs.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21fs/buffer: remove superfluous statementsDavidlohr Bueso1-3/+0
Get rid of those unnecessary return statements. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/20250515173925.147823-4-dave@stgolabs.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21fs/buffer: avoid redundant lookup in getblk slowpathDavidlohr Bueso1-3/+3
__getblk_slow() already implies failing a first lookup as the fastpath, so try to create the buffers immediately and avoid the redundant lookup. This saves 5-10% of the total cost/latency of the slowpath. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/20250515173925.147823-3-dave@stgolabs.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21fs/buffer: use sleeping lookup in __getblk_slowpath()Davidlohr Bueso1-1/+6
Just as with the fast path, call the lookup variant depending on the gfp flags. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/20250515173925.147823-2-dave@stgolabs.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fs: Remove redundant errseq_set call in mark_buffer_write_io_error.Jeremy Bongio1-3/+1
mark_buffer_write_io_error sets sb->s_wb_err to -EIO twice. Once in mapping_set_error and once in errseq_set. Only mapping_set_error checks if bh->b_assoc_map->host is NULL. Discovered during null pointer dereference during writeback to a failing device: [<ffffffff9a416dc8>] ? mark_buffer_write_io_error+0x98/0xc0 [<ffffffff9a416dbe>] ? mark_buffer_write_io_error+0x8e/0xc0 [<ffffffff9ad4bda0>] end_buffer_async_write+0x90/0xd0 [<ffffffff9ad4e3eb>] end_bio_bh_io_sync+0x2b/0x40 [<ffffffff9adbafe6>] blk_update_request+0x1b6/0x480 [<ffffffff9adbb3d8>] blk_mq_end_request+0x18/0x30 [<ffffffff9adbc6aa>] blk_mq_dispatch_rq_list+0x4da/0x8e0 [<ffffffff9adc0a68>] __blk_mq_sched_dispatch_requests+0x218/0x6a0 [<ffffffff9adc07fa>] blk_mq_sched_dispatch_requests+0x3a/0x80 [<ffffffff9adbbb98>] blk_mq_run_hw_queue+0x108/0x330 [<ffffffff9adbcf58>] blk_mq_flush_plug_list+0x178/0x5f0 [<ffffffff9adb6741>] __blk_flush_plug+0x41/0x120 [<ffffffff9adb6852>] blk_finish_plug+0x22/0x40 [<ffffffff9ad47cb0>] wb_writeback+0x150/0x280 [<ffffffff9ac5343f>] ? set_worker_desc+0x9f/0xc0 [<ffffffff9ad4676e>] wb_workfn+0x24e/0x4a0 Fixes: 485e9605c0573 ("fs/buffer.c: record blockdev write errors in super_block that it backs") Signed-off-by: Jeremy Bongio <jbongio@google.com> Link: https://lore.kernel.org/20250507123010.1228243-1-jbongio@google.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-22mm/migrate: fix sleep in atomic for large folios and buffer headsDavidlohr Bueso1-1/+11
The large folio + buffer head noref migration scenarios are being naughty and blocking while holding a spinlock. As a consequence of the pagecache lookup path taking the folio lock this serializes against migration paths, so they can wait for each other. For the private_lock atomic case, a new BH_Migrate flag is introduced which enables the lookup to bail. This allows the critical region of the private_lock on the migration path to be reduced to the way it was before ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration"), that is covering the count checks. The scope is always noref migration. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com Fixes: 3c20917120ce61 ("block/bdev: enable large folio support for large logical block sizes") Reviewed-by: Jan Kara <jack@suse.cz> Co-developed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-8-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev # [0] [1] Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-22fs/buffer: use sleeping version of __find_get_block()Davidlohr Bueso1-2/+9
Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. Convert write_boundary_block() which already takes the buffer lock as well as bdev_getblk() depending on the respective gpf flags. There are no changes in semantics. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-4-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev # [0] [1] Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-22fs/buffer: introduce sleeping flavors for pagecache lookupsDavidlohr Bueso1-0/+9
Add __find_get_block_nonatomic() and sb_find_get_block_nonatomic() calls for which users will be converted where safe. These versions will take the folio lock instead of the mapping's private_lock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-3-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-22fs/buffer: split locking for pagecache lookupsDavidlohr Bueso1-16/+25
Callers of __find_get_block() may or may not allow for blocking semantics, and is currently assumed that it will not. Layout two paths based on this. The the private_lock scheme will continued to be used for atomic contexts. Otherwise take the folio lock instead, which protects the buffers, such as vs migration and try_to_free_buffers(). Per the "hack idea", the latter can alleviate contention on the private_lock for bdev mappings. For reasons of determinism and avoid making bugs hard to reproduce, the trylocking is not attempted. No change in semantics. All lookup users still take the spinlock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-2-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07fs: Remove aops->writepageMatthew Wilcox (Oracle)1-2/+2
All callers and implementations are now removed, so remove the operation and update the documentation to match. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250402150005.2309458-10-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds1-10/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-16fs: convert block_commit_write() to take a folioMatthew Wilcox (Oracle)1-10/+4
All callers now have a folio, so pass it in instead of converting folio->page->folio. Link: https://lkml.kernel.org/r/20250217192009.437916-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-24fs/buffer fs/mpage: remove large folio restrictionLuis Chamberlain1-2/+0
Now that buffer-heads has been converted over to support large folios we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents their use. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-6-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/buffer: remove batching from async readMatthew Wilcox1-30/+21
block_read_full_folio() currently puts all !uptodate buffers into an array allocated on the stack, then iterates over it twice, first locking the buffers and then submitting them for read. We want to remove this array because it occupies too much stack space on configurations with a larger PAGE_SIZE (eg 512 bytes with 8 byte pointers and a 64KiB PAGE_SIZE). We cannot simply submit buffer heads as we find them as the completion handler needs to be able to tell when all reads are finished, so it can end the folio read. So we keep one buffer in reserve (using the 'prev' variable) until the end of the function. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-3-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-24fs/buffer: simplify block_read_full_folio() with bh_offset()Luis Chamberlain1-3/+2
When we read over all buffers in a folio we currently use the buffer index on the folio and blocksize to get the offset. Simplify this with bh_offset(). This simplifies the loop while making no functional changes. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-2-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds1-5/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-21Merge tag 'reiserfs_delete' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs removal from Jan Kara: "The deprecation period of reiserfs is ending at the end of this year so it is time to remove it" * tag 'reiserfs_delete' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: The last commit
2024-11-18Merge tag 'vfs-6.13.pagecache' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagecache updates from Christian Brauner: "Cleanup filesystem page flag usage: This continues the work to make the mappedtodisk/owner_2 flag available to filesystems which don't use buffer heads. Further patches remove uses of Private2. This brings us very close to being rid of it entirely" * tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: migrate: Remove references to Private2 ceph: Remove call to PagePrivate2() btrfs: Switch from using the private_2 flag to owner_2 mm: Remove PageMappedToDisk nilfs2: Convert nilfs_copy_buffer() to use folios fs: Move clearing of mappedtodisk to buffer.c
2024-11-06memcg-v1: no need for memcg locking for dirty trackingShakeel Butt1-5/+0
During the era of memcg charge migration, the kernel has to be make sure that the dirty stat updates do not race with the charge migration. Otherwise it might update the dirty stats of the wrong memcg. Now with the memcg charge migration gone, there is no more race for dirty stat updates and the previous locking can be removed. Link: https://lkml.kernel.org/r/20241025012304.2473312-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28fs/writeback: convert wbc_account_cgroup_owner to take a folioPankaj Raghav1-2/+2
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21reiserfs: The last commitJan Kara1-2/+1
Deprecation period of reiserfs ends with the end of this year so it is time to remove it from the kernel. Acked-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2024-10-04fs: Move clearing of mappedtodisk to buffer.cMatthew Wilcox (Oracle)1-0/+1
The mappedtodisk flag is only meaningful for buffer head based filesystems. It should not be cleared for other filesystems. This allows us to reuse the mappedtodisk flag to have other meanings in filesystems that do not use buffer heads. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-2-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-16Merge tag 'vfs-6.12.folio' of ↵Linus Torvalds1-32/+31
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs folio updates from Christian Brauner: "This contains work to port write_begin and write_end to rely on folios for various filesystems. This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs, ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and squashfs. After this series lands a bunch of the filesystems in this list do not mention struct page anymore" * tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits) Squashfs: Ensure all readahead pages have been used Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index Squashfs: Update squashfs_readpage_block() to not use page->index Squashfs: Update squashfs_readahead() to not use page->index Squashfs: Update page_actor to not use page->index jffs2: Use a folio in jffs2_garbage_collect_dnode() jffs2: Convert jffs2_do_readpage_nolock to take a folio buffer: Convert __block_write_begin() to take a folio ocfs2: Convert ocfs2_write_zero_page to use a folio fs: Convert aops->write_begin to take a folio fs: Convert aops->write_end to take a folio vboxsf: Use a folio in vboxsf_write_end() orangefs: Convert orangefs_write_begin() to use a folio orangefs: Convert orangefs_write_end() to use a folio jffs2: Convert jffs2_write_begin() to use a folio jffs2: Convert jffs2_write_end() to use a folio hostfs: Convert hostfs_write_end() to use a folio fuse: Convert fuse_write_begin() to use a folio fuse: Convert fuse_write_end() to use a folio f2fs: Convert f2fs_write_begin() to use a folio ...
2024-08-30fs: drop GFP_NOFAIL mode from alloc_page_buffersMichal Hocko1-4/+1
There is only one called of alloc_page_buffers and it doesn't require __GFP_NOFAIL so drop this allocation mode. Signed-off-by: Michal Hocko <mhocko@suse.com> Link: https://lore.kernel.org/r/20240829130640.1397970-1-mhocko@kernel.org Acked-by: Song Liu <song@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: use LIST_HEAD() to simplify codeHongbo Li1-2/+1
list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240821065456.2294216-1-lihongbo22@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Convert __block_write_begin() to take a folioMatthew Wilcox (Oracle)1-3/+2
Almost all callers have a folio now, so change __block_write_begin() to take a folio and remove a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)1-14/+14
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)1-5/+4
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Convert block_write_end() to take a folioMatthew Wilcox (Oracle)1-3/+2
All callers now have a folio, so pass it in instead of converting from a folio to a page and back to a folio again. Saves a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Use a folio in generic_write_end()Matthew Wilcox (Oracle)1-3/+4
Replaces two implicit calls to compound_head() with one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert block_write_begin() to use a folioMatthew Wilcox (Oracle)1-9/+10
Use the folio APIs to retrieve the folio from the page cache and manipulate it. Saves a few conversions between pages & folios. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-18Merge tag 'ext4_for_linus-6.11-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many cleanups and bug fixes in ext4, especially for the fast commit feature. Also some performance improvements; in particular, improving IOPS and throughput on fast devices running Async Direct I/O by up to 20% by optimizing jbd2_transaction_committed()" * tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: make sure the first directory block is not a hole ext4: check dot and dotdot of dx_root before making dir indexed ext4: sanity check for NULL pointer after ext4_force_shutdown jbd2: increase maximum transaction size jbd2: drop pointless shrinker batch initialization jbd2: avoid infinite transaction commit loop jbd2: precompute number of transaction descriptor blocks jbd2: make jbd2_journal_get_max_txn_bufs() internal jbd2: avoid mount failed when commit block is partial submitted ext4: avoid writing unitialized memory to disk in EA inodes ext4: don't track ranges in fast_commit if inode has inlined data ext4: fix possible tid_t sequence overflows ext4: use ext4_update_inode_fsync_trans() helper in inode creation ext4: add missing MODULE_DESCRIPTION() jbd2: add missing MODULE_DESCRIPTION() ext4: use memtostr_pad() for s_volume_name jbd2: speed up jbd2_transaction_committed() ext4: make ext4_da_map_blocks() buffer_head unaware ext4: make ext4_insert_delayed_block() insert multi-blocks ext4: factor out a helper to check the cluster allocation state ...
2024-07-08ext4: sanity check for NULL pointer after ext4_force_shutdownWojciech Gładysz1-0/+2
Test case: 2 threads write short inline data to a file. In ext4_page_mkwrite the resulting inline data is converted. Handling ext4_grp_locked_error with description "block bitmap and bg descriptor inconsistent: X vs Y free clusters" calls ext4_force_shutdown. The conversion clears EXT4_STATE_MAY_INLINE_DATA but fails for ext4_destroy_inline_data_nolock and ext4_mark_iloc_dirty due to ext4_forced_shutdown. The restoration of inline data fails for the same reason not setting EXT4_STATE_MAY_INLINE_DATA. Without the flag set a regular process path in ext4_da_write_end follows trying to dereference page folio private pointer that has not been set. The fix calls early return with -EIO error shall the pointer to private be NULL. Sample crash report: Unable to handle kernel paging request at virtual address dfff800000000004 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000004] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 20274 Comm: syz-executor185 Not tainted 6.9.0-rc7-syzkaller-gfda5695d692c #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 lr : __block_commit_write+0x3c/0x2b0 fs/buffer.c:2160 sp : ffff8000a1957600 x29: ffff8000a1957610 x28: dfff800000000000 x27: ffff0000e30e34b0 x26: 0000000000000000 x25: dfff800000000000 x24: dfff800000000000 x23: fffffdffc397c9e0 x22: 0000000000000020 x21: 0000000000000020 x20: 0000000000000040 x19: fffffdffc397c9c0 x18: 1fffe000367bd196 x17: ffff80008eead000 x16: ffff80008ae89e3c x15: 00000000200000c0 x14: 1fffe0001cbe4e04 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : 0000000000000000 x8 : 0000000000000004 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffdffc397c9c0 x4 : 0000000000000020 x3 : 0000000000000020 x2 : 0000000000000040 x1 : 0000000000000020 x0 : fffffdffc397c9c0 Call trace: __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 block_write_end+0xb4/0x104 fs/buffer.c:2253 ext4_da_do_write_end fs/ext4/inode.c:2955 [inline] ext4_da_write_end+0x2c4/0xa40 fs/ext4/inode.c:3028 generic_perform_write+0x394/0x588 mm/filemap.c:3985 ext4_buffered_write_iter+0x2c0/0x4ec fs/ext4/file.c:299 ext4_file_write_iter+0x188/0x1780 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0x968/0xc3c fs/read_write.c:590 ksys_write+0x15c/0x26c fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __arm64_sys_write+0x7c/0x90 fs/read_write.c:652 __invoke_syscall arch/arm64/kernel/syscall.c:34 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Code: 97f85911 f94002da 91008356 d343fec8 (38796908) ---[ end trace 0000000000000000 ]--- ---------------- Code disassembly (best guess): 0: 97f85911 bl 0xffffffffffe16444 4: f94002da ldr x26, [x22] 8: 91008356 add x22, x26, #0x20 c: d343fec8 lsr x8, x22, #3 * 10: 38796908 ldrb w8, [x8, x25] <-- trapping instruction Reported-by: syzbot+18df508cf00a0598d9a6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18df508cf00a0598d9a6 Link: https://lore.kernel.org/all/000000000000f19a1406109eb5c5@google.com/T/ Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240703070112.10235-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-31buffer: Remove calls to set and clear the folio error flagMatthew Wilcox (Oracle)1-6/+1
The folio error flag is not tested anywhere, so we can stop setting and clearing it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240530202110.2653630-17-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-21Merge tag 'pull-bd_inode-1' of ↵Linus Torvalds1-13/+13
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number
2024-05-05buffer: improve bdev_getblk documentationMatthew Wilcox (Oracle)1-0/+5
Add some more information about the state of the buffer_head returned. Link: https://lkml.kernel.org/r/20240416031754.4076917-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for bforget() and __bforget()Matthew Wilcox (Oracle)1-3/+6
Distinguish these functions from brelse() and __brelse(). Link: https://lkml.kernel.org/r/20240416031754.4076917-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for brelse() and __brelse()Matthew Wilcox (Oracle)1-9/+8
Move the documentation for __brelse() to brelse(), format it as kernel-doc and update it from talking about pages to folios. Link: https://lkml.kernel.org/r/20240416031754.4076917-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: fix __bread and __bread_gfp kernel-docMatthew Wilcox (Oracle)1-13/+22
The extra indentation confused the kernel-doc parser, so remove it. Fix some other wording while I'm here, and advise the user they need to call brelse() on this buffer. __bread_gfp() isn't used directly by filesystems, but the other wrappers for it don't have documentation, so document it accordingly. Link: https://lkml.kernel.org/r/20240416031754.4076917-5-willy@infradead.org Co-developed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for try_to_free_buffers()Matthew Wilcox (Oracle)1-20/+24
The documentation for this function has become separated from it over time; move it to the right place and turn it into kernel-doc. Mild editing of the content to make it more about what the function does, and less about how it does it. Link: https://lkml.kernel.org/r/20240416031754.4076917-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for block_dirty_folio()Matthew Wilcox (Oracle)1-24/+31
Turn the excellent documentation for this function into kernel-doc. Replace 'page' with 'folio' and make a few other minor updates. Link: https://lkml.kernel.org/r/20240416031754.4076917-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-03fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mappingAl Viro1-8/+8
both for ->i_blkbits and both want the address_space in question anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03grow_dev_folio(): we only want ->bd_inode->i_mapping thereAl Viro1-4/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-3-viro@zeniv.linux.org.uk Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro1-1/+1
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-11Merge tag 'vfs-6.9.iomap' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Restore read-write hints in struct bio through the bi_write_hint member for the sake of UFS devices in mobile applications. This can result in up to 40% lower write amplification in UFS devices. The patch series that builds on this will be coming in via the SCSI maintainers (Bart) - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able to map multiple blocks at once as long as they're in the same folio. This reduces CPU usage for buffered write workloads on e.g., xfs on systems with lots of cores (Christoph) - Record processed bytes in iomap_iter() trace event (Kassey) - Extend iomap_writepage_map() trace event after Christoph's ->map_block() changes to map mutliple blocks at once (Zhang) * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) iomap: Add processed for iomap_iter iomap: add pos and dirty_len into trace_iomap_writepage_map block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation iomap: pass the length of the dirty region to ->map_blocks iomap: map multiple blocks at a time iomap: submit ioends immediately iomap: factor out a iomap_writepage_map_block helper iomap: only call mapping_set_error once for each failed bio iomap: don't chain bios iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend iomap: clean up the iomap_alloc_ioend calling convention iomap: move all remaining per-folio logic into iomap_writepage_map iomap: factor out a iomap_writepage_handle_eof helper iomap: move the PF_MEMALLOC check to iomap_writepages iomap: move the io_folios field out of struct iomap_ioend ...
2024-02-27vfs: remove SLAB_MEM_SPREAD flag usageChengming Zhou1-1/+1
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1 (see [1]), so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete explicitly to avoid confusion for users. Here we can just remove all its users, which has no any functional change. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz [1] Link: https://lore.kernel.org/r/20240224135315.830477-1-chengming.zhou@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06block, fs: Restore the per-bio/request data lifetime fieldsBart Van Assche1-4/+8
Restore support for passing data lifetime information from filesystems to block drivers. This patch reverts commit b179c98f7697 ("block: Remove request.write_hint") and commit c75e707fe1aa ("block: remove the per-bio/request write hint"). This patch does not modify the size of struct bio because the new bi_write_hint member fills a hole in struct bio. pahole reports the following for struct bio on an x86_64 system with this patch applied: /* size: 112, cachelines: 2, members: 20 */ /* sum members: 110, holes: 1, sum holes: 2 */ /* last cacheline: 48 bytes */ Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22buffer: Use KMEM_CACHE instead of kmem_cache_create()Kunwu Chan1-6/+2
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240116091137.92375-1-chentao@kylinos.cn Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22fs: Wrong function name in commentAndreas Gruenbacher1-1/+1
This comment refers to function mark_buffer_inode_dirty(), but the function is actually called mark_buffer_dirty_inode(), so fix the comment. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Link: https://lore.kernel.org/r/20240108172040.178173-1-agruenba@redhat.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds1-98/+77
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-05buffer: fix unintended successful returnMatthew Wilcox (Oracle)1-5/+12
If try_to_free_buffers() succeeded and then folio_alloc_buffers() failed, grow_dev_folio() would return success. This would be incorrect; memory allocation failure is supposed to result in a failure. It's a harmless bug; the caller will simply go around the loop one more time and grow_dev_folio() will correctly return a failure that time. But it was an unintended change and looks like a more serious bug than it is. While I'm in here, improve the commentary about why we return success even though we failed. Link: https://lkml.kernel.org/r/20240101093848.2017115-1-willy@infradead.org Fixes: 6d840a18773f ("buffer: return bool from grow_dev_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29fs: remove the bh_end_io argument from __block_write_full_folioMatthew Wilcox (Oracle)1-12/+10
All callers are passing end_buffer_async_write as this argument, so we can hardcode references to it within __block_write_full_folio(). That lets us make end_buffer_async_write() static. Link: https://lkml.kernel.org/r/20231215200245.748418-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29fs: convert block_write_full_page to block_write_full_folioMatthew Wilcox (Oracle)1-9/+7
Convert the function to be compatible with writepage_t so that it can be passed to write_cache_pages() by blkdev. This removes a call to compound_head(). We can also remove the function export as both callers are built-in. Link: https://lkml.kernel.org/r/20231215200245.748418-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: fix more functions for block size > PAGE_SIZEMatthew Wilcox (Oracle)1-21/+6
Both __block_write_full_folio() and block_read_full_folio() assumed that block size <= PAGE_SIZE. Replace the shift with a divide, which is probably cheaper than first calculating the shift. That lets us remove block_size_bits() as these were the last callers. Link: https://lkml.kernel.org/r/20231109210608.2252323-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: handle large folios in __block_write_begin_int()Matthew Wilcox (Oracle)1-10/+7
When __block_write_begin_int() was converted to support folios, we did not expect large folios to be passed to it. With the current work to support large block size storage devices, this will no longer be true so change the checks on 'from' and 'to' to be related to the size of the folio instead of PAGE_SIZE. Also remove an assumption that the block size is smaller than PAGE_SIZE. Link: https://lkml.kernel.org/r/20231109210608.2252323-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: fix various functions for block size > PAGE_SIZEMatthew Wilcox (Oracle)1-5/+5
If i_blkbits is larger than PAGE_SHIFT, we shift by a negative number, which is undefined. It is safe to shift the block left as a block device must be smaller than MAX_LFS_FILESIZE, which is guaranteed to fit in loff_t. Link: https://lkml.kernel.org/r/20231109210608.2252323-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: cast block to loff_t before shifting itMatthew Wilcox (Oracle)1-1/+1
While sector_t is always defined as a u64 today, that hasn't always been the case and it might not always be the same size as loff_t in the future. Link: https://lkml.kernel.org/r/20231109210608.2252323-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: fix grow_buffers() for block size > PAGE_SIZEMatthew Wilcox (Oracle)1-11/+6
We must not shift by a negative number so work in terms of a byte offset to avoid the awkward shift left-or-right-depending-on-sign option. This means we need to use check_mul_overflow() to ensure that a large block number does not result in a wrap. Link: https://lkml.kernel.org/r/20231109210608.2252323-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nathan@kernel.org: add cast in grow_buffers() to avoid a multiplication libcall] Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: calculate block number inside folio_init_buffers()Matthew Wilcox (Oracle)1-7/+6
The calculation of block from index doesn't work for devices with a block size larger than PAGE_SIZE as we end up shifting by a negative number. Instead, calculate the number of the first block from the folio's position in the block device. We no longer need to pass sizebits to grow_dev_folio(). Link: https://lkml.kernel.org/r/20231109210608.2252323-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29buffer: return bool from grow_dev_folio()Matthew Wilcox (Oracle)1-25/+25
Patch series "More buffer_head cleanups", v2. The first patch is a left-over from last cycle. The rest fix "obvious" block size > PAGE_SIZE problems. I haven't tested with a large block size setup (but I have done an ext4 xfstests run). This patch (of 7): Rename grow_dev_page() to grow_dev_folio() and make it return a bool. Document what that bool means; it's more subtle than it first appears. Also rename the 'failed' label to 'unlock' beacuse it's not exactly 'failed'. It just hasn't succeeded. Link: https://lkml.kernel.org/r/20231109210608.2252323-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-21fs: Rename mapping private membersMatthew Wilcox (Oracle)1-54/+54
It is hard to find where mapping->private_lock, mapping->private_list and mapping->private_data are used, due to private_XXX being a relatively common name for variables and structure members in the kernel. To fit with other members of struct address_space, rename them all to have an i_ prefix. Tested with an allmodconfig build. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-10-25buffer: remove folio_create_empty_buffers()Matthew Wilcox (Oracle)1-10/+3
With all users converted, remove the old create_empty_buffers() and rename folio_create_empty_buffers() to create_empty_buffers(). Link: https://lkml.kernel.org/r/20231016201114.1928083-28-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25buffer: make folio_create_empty_buffers() return a buffer_headMatthew Wilcox (Oracle)1-11/+13
Patch series "Finish the create_empty_buffers() transition", v2. Pankaj recently added folio_create_empty_buffers() as the folio equivalent to create_empty_buffers(). This patch set finishes the conversion by first converting all remaining filesystems to call folio_create_empty_buffers(), then renaming it back to create_empty_buffers(). I took the opportunity to make a few simplifications like making folio_create_empty_buffers() return the head buffer and extracting get_nth_bh() from nilfs2. A few of the patches in this series aren't directly related to create_empty_buffers(), but I saw them while I was working on this and thought they'd be easy enough to add to this series. Compile-tested only, other than ext4. This patch (of 26): Almost all callers want to know the first BH that was allocated for this folio. We already have that handy, so return it. Link: https://lkml.kernel.org/r/20231016201114.1928083-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231016201114.1928083-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18treewide: mark stuff as __ro_after_initAlexey Dobriyan1-2/+2
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18buffer: use folio_end_read()Matthew Wilcox (Oracle)1-12/+4
There are two places that we can use this new helper. Link: https://lkml.kernel.org/r/20231004165317.1061855-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04buffer: remove __getblk_gfp()Matthew Wilcox (Oracle)1-25/+11
Inline it into __bread_gfp(). Link: https://lkml.kernel.org/r/20230914150011.843330-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hui Zhu <teawater@antgroup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04buffer: use bdev_getblk() to avoid memory reclaim in readahead pathMatthew Wilcox (Oracle)1-1/+3
__getblk() adds __GFP_NOFAIL, which is unnecessary for readahead; we're quite comfortable with the possibility that we may not get a bh back. Switch to bdev_getblk() which does not include __GFP_NOFAIL. Link: https://lkml.kernel.org/r/20230914150011.843330-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hui Zhu <teawater@antgroup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04buffer: hoist GFP flags from grow_dev_page() to __getblk_gfp()Matthew Wilcox (Oracle)1-21/+40
grow_dev_page() is only called by grow_buffers(). grow_buffers() is only called by __getblk_slow() and __getblk_slow() is only called from __getblk_gfp(), so it is safe to move the GFP flags setting all the way up. With that done, add a new bdev_getblk() entry point that leaves the GFP flags the way the caller specified them. [willy@infradead.org: fix grow_dev_page() error handling] Link: https://lkml.kernel.org/r/ZRREEIwqiy5DijKB@casper.infradead.org Link: https://lkml.kernel.org/r/20230914150011.843330-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hui Zhu <teawater@antgroup.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04buffer: pass GFP flags to folio_alloc_buffers()Matthew Wilcox (Oracle)1-8/+9
Patch series "Add and use bdev_getblk()", v2. This patch series fixes a bug reported by Hui Zhu; see proposed patches v1 and v2: https://lore.kernel.org/linux-fsdevel/20230811035705.3296-1-teawaterz@linux.alibaba.com/ https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/ I decided to go in a rather different direction for this fix, and fix a related problem at the same time. I don't think there's any urgency to rush this into Linus' tree, nor have I marked it for stable. Reasonable people may disagree. This patch (of 8): Instead of creating entirely new flags, inherit them from grow_dev_page(). The other callers create the same flags that this function used to create. Link: https://lkml.kernel.org/r/20230914150011.843330-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230914150011.843330-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hui Zhu <teawater@antgroup.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-25iomap: add a workaround for racy i_size updates on block devicesChristoph Hellwig1-1/+10
A szybot reproducer that does write I/O while truncating the size of a block device can end up in clean_bdev_aliases, which tries to clean the bdev aliases that it uses. This is because iomap_to_bh automatically sets the BH_New flag when outside of i_size. For block devices updates to i_size are racy and we can hit this case in a tiny race window, leading to the eventual clean_bdev_aliases call. Fix this by erroring out of > i_size I/O on block devices. Reported-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-09-12iomap: handle error conditions more gracefully in iomap_to_bhChristoph Hellwig1-11/+14
iomap_to_bh currently BUG()s when the passed in block number is not in the iomap. For file systems that have proper synchronization this should never happen and so far hasn't in mainline, but for block devices size changes aren't fully synchronized against ongoing I/O. Instead of BUG()ing in this case, return -EIO to the caller, which already has proper error handling. While we're at it, also return -EIO for an unknown iomap state instead of returning garbage. Fixes: 487c607df790 ("block: use iomap for writes to block devices") Reported-by: syzbot+4a08ffdf3667b36650a1@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds1-6/+0
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds1-28/+8
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds1-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-18buffer: remove set_bh_page()Matthew Wilcox (Oracle)1-15/+0
With all users converted to folio_set_bh(), remove this function. Link: https://lkml.kernel.org/r/20230713035512.4139457-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18fs: convert block_commit_write to return voidBean Huo1-8/+6
block_commit_write() always returns 0, this patch changes it to return void. Link: https://lkml.kernel.org/r/20230626055518.842392-3-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18fs/buffer: clean up block_commit_writeBean Huo1-7/+4
Originally inode is used to get blksize, after commit 45bce8f3e343 ("fs/buffer.c: make block-size be per-page and protected by the page lock"), __block_commit_write no longer uses this parameter inode. [akpm@linux-foundation.org: remove now-unused local `inode'] Link: https://lkml.kernel.org/r/20230626055518.842392-2-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-15fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUsMarcelo Tosatti1-1/+6
For certain types of applications (for example PLC software or RAN processing), upon occurrence of an event, it is necessary to complete a certain task in a maximum amount of time (deadline). One way to express this requirement is with a pair of numbers, deadline time and execution time, where: * deadline time: length of time between event and deadline. * execution time: length of time it takes for processing of event to occur on a particular hardware platform (uninterrupted). The particular values depend on use-case. For the case where the realtime application executes in a virtualized guest, an IPI which must be serviced in the host will cause the following sequence of events: 1) VM-exit 2) execution of IPI (and function call) 3) VM-entry Which causes an excess of 50us latency as observed by cyclictest (this violates the latency requirement of vRAN application with 1ms TTI, for example). invalidate_bh_lrus calls an IPI on each CPU that has non empty per-CPU cache: on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); The performance when using the per-CPU LRU cache is as follows: 42 ns per __find_get_block 68 ns per __find_get_block_slow Given that the main use cases for latency sensitive applications do not involve block I/O (data necessary for program operation is locked in RAM), disable per-CPU buffer_head caches for isolated CPUs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Message-Id: <ZJtBrybavtb1x45V@tpad> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09fs: stop using bdev->bd_super in mark_buffer_write_io_errorChristoph Hellwig1-8/+3
bdev->bd_super is a somewhat awkward backpointer from a block device to an owning file system with unclear rules. For the buffer_head code we already have a good backpointer for the inode that the buffer_head is associated with, even if it lives on the block device mapping: b_assoc_map. It is used track dirty buffers associated with an inode but living on the block device mapping like directory buffers in ext4. mark_buffer_write_io_error already uses it for the call to mapping_set_error, and should be doing the same for the per-sb error sequence. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230807112625.652089-2-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-02fs: remove emergency_thaw_bdevChristoph Hellwig1-6/+0
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c to be built only when buffer_head support is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-29Merge tag 'fs_for_v6.5-rc1' of ↵Linus Torvalds1-0/+70
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem updates from Jan Kara: - Rewrite kmap_local() handling in ext2 - Convert ext2 direct IO path to iomap (with some infrastructure tweaks associated with that) - Convert two boilerplate licenses in udf to SPDX identifiers - Other small udf, ext2, and quota fixes and cleanups * tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix uninitialized array access for some pathnames ext2: Drop fragment support quota: fix warning in dqgrab() quota: Properly disable quotas when add_dquot_ref() fails fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier fs: Drop wait_unfrozen wait queue ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore ext2_{set_link,delete_entry}(): don't bother with page_addr ext2_put_page(): accept any pointer within the page ext2_get_page(): saner type ext2: use offset_in_page() instead of open-coding it as subtraction ext2_rename(): set_link and delete_entry may fail ext2: Add direct-io trace points ext2: Move direct-io to use iomap ext2: Use generic_buffers_fsync() implementation ext4: Use generic_buffers_fsync_noflush() implementation fs/buffer.c: Add generic_buffers_fsync*() implementation ext2/dax: Fix ext2_setsize when len is page aligned
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-132/+127
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-2/+1
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-19buffer: convert block_truncate_page() to use a folioMatthew Wilcox (Oracle)1-14/+16
Support large folios in block_truncate_page() and avoid three hidden calls to compound_head(). [willy@infradead.org: fix check of filemap_grab_folio() return value] Link: https://lkml.kernel.org/r/ZItZOt+XxV12HtzL@casper.infradead.org Link: https://lkml.kernel.org/r/20230612210141.730128-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: use a folio in __find_get_block_slow()Matthew Wilcox (Oracle)1-6/+6
Saves a call to compound_head() and may be needed to support block size > PAGE_SIZE. Link: https://lkml.kernel.org/r/20230612210141.730128-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert link_dev_buffers to take a folioMatthew Wilcox (Oracle)1-4/+4
Its one caller already has a folio, so switch it to use the folio API. Removes a hidden call to compound_head(). Link: https://lkml.kernel.org/r/20230612210141.730128-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert init_page_buffers() to folio_init_buffers()Matthew Wilcox (Oracle)1-10/+8
Use the folio API and pass the folio from both callers. Saves a hidden call to compound_head(). Link: https://lkml.kernel.org/r/20230612210141.730128-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert grow_dev_page() to use a folioMatthew Wilcox (Oracle)1-19/+15
Get a folio from the page cache instead of a page, then use the folio API throughout. Removes a few calls to compound_head() and may be needed to support block size > PAGE_SIZE. Link: https://lkml.kernel.org/r/20230612210141.730128-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert page_zero_new_buffers() to folio_zero_new_buffers()Matthew Wilcox (Oracle)1-13/+14
Most of the callers already have a folio; convert reiserfs_write_end() to have a folio. Removes a couple of hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20230612210141.730128-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert __block_commit_write() to take a folioMatthew Wilcox (Oracle)1-19/+19
This removes a hidden call to compound_head() inside __block_commit_write() and moves it to those callers which are still page based. Also make block_write_end() safe for large folios. Link: https://lkml.kernel.org/r/20230612210141.730128-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert block_page_mkwrite() to use a folioMatthew Wilcox (Oracle)1-14/+13
If any page in a folio is dirtied, dirty the entire folio. Removes a number of hidden calls to compound_head() and references to page->mapping and page->index. Fixes a pre-existing bug where we could mark a folio as dirty if the file is truncated to a multiple of the page size just as we take the page fault. I don't believe this bug has any bad effect, it's just inefficient. Link: https://lkml.kernel.org/r/20230612210141.730128-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: make block_write_full_page() handle large folios correctlyMatthew Wilcox (Oracle)1-12/+10
Keep the interface as struct page, but work entirely on the folio internally. Removes several PAGE_SIZE assumptions and removes some references to page->index and page->mapping. Link: https://lkml.kernel.org/r/20230612210141.730128-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19buffer: convert __block_write_full_page() to __block_write_full_folio()Matthew Wilcox (Oracle)1-26/+27
Remove nine hidden calls to compound_head() by using a folio instead of a page. Link: https://lkml.kernel.org/r/20230612210141.730128-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08fs: unexport buffer_check_dirty_writebackChristoph Hellwig1-1/+0
buffer_check_dirty_writeback is only used by the block device aops, remove the export. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230608122958.276954-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-31fs: buffer: use __bio_add_page to add single page to bioJohannes Thumshirn1-2/+1
The buffer_head submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Reviewed-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/84ff2dcbe81b258a73ad900adb5266e208b61a4d.1685532726.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16fs/buffer.c: Add generic_buffers_fsync*() implementationRitesh Harjani (IBM)1-0/+70
Some of the higher layers like iomap takes inode_lock() when calling generic_write_sync(). Also writeback already happens from other paths without inode lock, so it's difficult to say that we really need sync_mapping_buffers() to take any inode locking here. Having said that, let's add generic_buffers_fsync/_noflush() implementation in buffer.c with no inode_lock/unlock() for now so that filesystems like ext2 and ext4's nojournal mode can use it. Ext4 when got converted to iomap for direct-io already copied it's own variant of __generic_file_fsync() without lock. This patch adds generic_buffers_fsync() & generic_buffers_fsync_noflush() implementations for use in filesystems like ext2 & ext4 respectively. Later we can review other filesystems as well to see if we can make generic_buffers_fsync/_noflush() which does not take any inode_lock() as the default path. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <d573408ac8408627d23a3d2d166e748c172c4c9e.1682069716.git.ritesh.list@gmail.com>
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-29/+60
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-26Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds1-5/+4
Pull fsverity updates from Eric Biggers: "Several cleanups and fixes for fs/verity/, including a couple minor fixes to the changes in 6.3 that added support for Merkle tree block sizes less than the page size" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: reject FS_IOC_ENABLE_VERITY on mode 3 fds fsverity: explicitly check for buffer overflow in build_merkle_tree() fsverity: use WARN_ON_ONCE instead of WARN_ON fs-verity: simplify sysctls with register_sysctl() fs/buffer.c: use b_folio for fsverity work
2023-04-26Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds1-2/+2
Pull fscrypt updates from Eric Biggers: "A few cleanups for fs/crypto/, and another patch to prepare for the upcoming CephFS encryption support" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: optimize fscrypt_initialize() fscrypt: use WARN_ON_ONCE instead of WARN_ON fscrypt: new helper function - fscrypt_prepare_lookup_partial() fs/buffer.c: use b_folio for fscrypt work
2023-04-21fs/buffer: convert create_page_buffers to folio_create_buffersPankaj Raghav1-10/+13
fs/buffer do not support large folios as there are many assumptions on the folio size to be the host page size. This conversion is one step towards removing that assumption. Also this conversion will reduce calls to compound_head() if folio_create_buffers() calls folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21fs/buffer: add folio_create_empty_buffers helperPankaj Raghav1-11/+17
Folio version of create_empty_buffers(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. It removes several calls to compound_head() as it works directly on folio compared to create_empty_buffers(). Hence, create_empty_buffers() has been modified to call folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21buffer: add folio_alloc_buffers() helperPankaj Raghav1-8/+15
Folio version of alloc_page_buffers() helper. This is required to convert create_page_buffers() to folio_create_buffers() later in the series. alloc_page_buffers() has been modified to call folio_alloc_buffers() which adds one call to compound_head() but folio_alloc_buffers() removes one call to compound_head() compared to the existing alloc_page_buffers() implementation. Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21fs/buffer: add folio_set_bh helperPankaj Raghav1-0/+15
Patch series "convert create_page_buffers to folio_create_buffers". One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. This patch (of 4): The folio version of set_bh_page(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-27fs/buffer.c: use b_folio for fsverity workEric Biggers1-5/+4
Use b_folio now that it exists. This removes an unnecessary call to compound_head(). No actual change in behavior. Link: https://lore.kernel.org/r/20230224232530.98440-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-03-27fs/buffer.c: use b_folio for fscrypt workEric Biggers1-2/+2
Use b_folio now that it exists. This removes an unnecessary call to compound_head(). No actual change in behavior. Link: https://lore.kernel.org/r/20230224232503.98372-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-03-27fs/buffer: Remove redundant assignment to errJiapeng Chong1-6/+3
Variable 'err' set but not used. fs/buffer.c:2613:2: warning: Value stored to 'err' is never read. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4589 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-27/+27
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-01-28fscrypt: support decrypting data from large foliosEric Biggers1-2/+2
Try to make the filesystem-level decryption functions in fs/crypto/ aware of large folios. This includes making fscrypt_decrypt_bio() support the case where the bio contains large folios, and making fscrypt_decrypt_pagecache_blocks() take a folio instead of a page. There's no way to actually test this with large folios yet, but I've tested that this doesn't cause any regressions. Note that this patch just handles *decryption*, not encryption which will be a little more difficult. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230127224202.355629-1-ebiggers@kernel.org
2023-01-27fsverity: support verifying data from large foliosEric Biggers1-1/+2
Try to make fs/verity/verify.c aware of large folios. This includes making fsverity_verify_bio() support the case where the bio contains large folios, and adding a function fsverity_verify_folio() which is the equivalent of fsverity_verify_page(). There's no way to actually test this with large folios yet, but I've tested that this doesn't cause any regressions. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230127221529.299560-1-ebiggers@kernel.org
2023-01-18buffer: use b_folio in mark_buffer_dirty()Matthew Wilcox (Oracle)1-6/+6
Removes about four calls to compound_head(). Two of them are inline which removes 132 bytes from the kernel text. Link: https://lkml.kernel.org/r/20221215214402.3522366-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18buffer: use b_folio in end_buffer_async_write()Matthew Wilcox (Oracle)1-5/+5
Save 76 bytes from avoiding the call to compound_head() in SetPageError(). Also avoid the call to compound_head() in end_page_writeback(). Link: https://lkml.kernel.org/r/20221215214402.3522366-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18buffer: use b_folio in end_buffer_async_read()Matthew Wilcox (Oracle)1-9/+9
Removes a call to compound_head() in SetPageError(), saving 76 bytes of text. Link: https://lkml.kernel.org/r/20221215214402.3522366-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18buffer: use b_folio in touch_buffer()Matthew Wilcox (Oracle)1-1/+1
Removes a call to compound_head() in this path. Link: https://lkml.kernel.org/r/20221215214402.3522366-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18buffer: replace obvious uses of b_page with b_folioMatthew Wilcox (Oracle)1-6/+6
These cases just check if it's NULL, or use b_page to get to the page's address space. They are assumptions that b_page never points to a tail page. Link: https://lkml.kernel.org/r/20221215214402.3522366-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-09fs/buffer.c: support fsverity in block_read_full_folio()Eric Biggers1-10/+57
After each filesystem block (as represented by a buffer_head) has been read from disk by block_read_full_folio(), verify it if needed. The verification is done on the fsverity_read_workqueue. Also allow reads of verity metadata past i_size, as required by ext4. This is needed to support fsverity on ext4 filesystems where the filesystem block size is less than the page size. The new code is compiled away when CONFIG_FS_VERITY=n. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20221223203638.41293-11-ebiggers@kernel.org
2022-10-12Merge tag 'mm-nonmm-stable-2022-10-11' of ↵Linus Torvalds1-9/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds1-93/+65
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-03mm: fs: initialize fsdata passed to write_begin/write_end interfaceAlexander Potapenko1-2/+2
Functions implementing the a_ops->write_end() interface accept the `void *fsdata` parameter that is supposed to be initialized by the corresponding a_ops->write_begin() (which accepts `void **fsdata`). However not all a_ops->write_begin() implementations initialize `fsdata` unconditionally, so it may get passed uninitialized to a_ops->write_end(), resulting in undefined behavior. Fix this by initializing fsdata with NULL before the call to write_begin(), rather than doing so in all possible a_ops implementations. This patch covers only the following cases found by running x86 KMSAN under syzkaller: - generic_perform_write() - cont_expand_zero() and generic_cont_expand_simple() - page_symlink() Other cases of passing uninitialized fsdata may persist in the codebase. Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-29fs/buffer: make submit_bh & submit_bh_wbc return type as voidRitesh Harjani (IBM)1-7/+6
submit_bh/submit_bh_wbc are non-blocking functions which just submit the bio and return. The caller of submit_bh/submit_bh_wbc needs to wait on buffer till I/O completion and then check buffer head's b_state field to know if there was any I/O error. Hence there is no need for these functions to have any return type. Even now they always returns 0. Hence drop the return value and make their return type as void to avoid any confusion. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/cb66ef823374cdd94d2d03083ce13de844fffd41.1660788334.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29fs/buffer: drop useless return value of submit_bhRitesh Harjani (IBM)1-6/+4
submit_bh always returns 0. This patch drops the useless return value of submit_bh from __sync_dirty_buffer(). Once all of submit_bh callers are cleaned up, we can make it's return type as void. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/a98a6ddfac68f73d684c2724952e825bc1f4d238.1660788334.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-11buffer: use try_cmpxchg in discard_bufferUros Bizjak1-9/+5
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in discard_buffer. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Link: https://lkml.kernel.org/r/20220714171653.12128-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fs/buffer: remove bh_submit_read() helperZhang Yi1-25/+0
bh_submit_read() has no user anymore, just remove it. Link: https://lkml.kernel.org/r/20220901133505.2510834-15-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fs/buffer: remove ll_rw_block() helperZhang Yi1-59/+4
Now that all ll_rw_block() users has been replaced to new safe helpers, we just remove it here. Link: https://lkml.kernel.org/r/20220901133505.2510834-13-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fs/buffer: replace ll_rw_block()Zhang Yi1-7/+5
ll_rw_block() is not safe for the sync IO path because it skip buffers which has been locked by others, it could lead to false positive EIO when submitting read IO. So stop using ll_rw_block(), switch to use new helpers which could guarantee buffer locked and submit IO if needed. Link: https://lkml.kernel.org/r/20220901133505.2510834-4-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fs/buffer: add some new buffer read helpersZhang Yi1-0/+65
Current ll_rw_block() helper is fragile because it assumes that locked buffer means it's under IO which is submitted by some other who holds the lock, it skip buffer if it failed to get the lock, so it's only safe on the readahead path. Unfortunately, now that most filesystems still use this helper mistakenly on the sync metadata read path. There is no guarantee that the one who holds the buffer lock always submit IO (e.g. buffer_migrate_folio_norefs() after commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages"), it could lead to false positive -EIO when submitting reading IO. This patch add some friendly buffer read helpers to prepare replacing ll_rw_block() and similar calls. We can only call bh_readahead_[] helpers for the readahead paths. Link: https://lkml.kernel.org/r/20220901133505.2510834-3-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11fs/buffer: remove __breadahead_gfp()Zhang Yi1-11/+0
Patch series "fs/buffer: remove ll_rw_block()", v2. ll_rw_block() will skip locked buffer before submitting IO, it assumes that locked buffer means it is under IO. This assumption is not always true because we cannot guarantee every buffer lock path would submit IO. After commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages"), buffer_migrate_folio_norefs() becomes one exceptional case, and there may be others. So ll_rw_block() is not safe on the sync read path, we could get false positive EIO return value when filesystem reading metadata. It seems that it could be only used on the readahead path. Unfortunately, many filesystem misuse the ll_rw_block() on the sync read path. This patch set just remove ll_rw_block() and add new friendly helpers, which could prevent false positive EIO on the read metadata path. Thanks for the suggestion from Jan, the original discussion is at [1]. patch 1: remove unused helpers in fs/buffer.c patch 2: add new bh_read_[*] helpers patch 3-11: remove all ll_rw_block() calls in filesystems patch 12-14: do some leftover cleanups. [1]. https://lore.kernel.org/linux-mm/20220825080146.2021641-1-chengzhihao1@huawei.com/ This patch (of 14): No one use __breadahead_gfp() and sb_breadahead_unmovable() any more, remove them. Link: https://lkml.kernel.org/r/20220901133505.2510834-1-yi.zhang@huawei.com Link: https://lkml.kernel.org/r/20220901133505.2510834-2-yi.zhang@huawei.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Heming Zhao <ocfs2-devel@oss.oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-03Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-342/+21
Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...
2022-08-02fs: remove the nobh helpersChristoph Hellwig1-324/+0
All callers are gone, so remove the now dead code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02buffer: Don't test folio error in block_read_full_folio()Matthew Wilcox (Oracle)1-2/+5
We can cache this information in a local variable instead of communicating from one part of the function to another via folio flags. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-07-15fs/buffer: Fix the ll_rw_block() kernel-doc headerBart Van Assche1-3/+2
Bring the ll_rw_block() kernel-doc header again in sync with the function prototype. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 1420c4a549bf ("fs/buffer: Combine two submit_bh() and ll_rw_block() arguments") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220715184735.2326034-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14fs/buffer: Combine two submit_bh() and ll_rw_block() argumentsBart Van Assche1-26/+27
Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14fs/buffer: Use the new blk_opf_t typeBart Van Assche1-10/+11
Improve static type checking by using the new blk_opf_t type for block layer request flags. Change WRITE into REQ_OP_WRITE. This patch does not change any functionality since REQ_OP_WRITE == WRITE == 1. Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-47-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-29buffer: Remove check for PageErrorMatthew Wilcox (Oracle)1-3/+3
If a buffer is completed with an error, its uptodate flag will be clear, so the page_uptodate variable will have been set to 0. There's no need to check PageError here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29buffer: Convert clean_bdev_aliases() to use filemap_get_folios()Matthew Wilcox (Oracle)1-13/+13
Use a folio throughout this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-05-09fs: Convert drop_buffers() to use a folioMatthew Wilcox (Oracle)1-8/+8
All callers now have a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09fs: Change try_to_free_buffers() to take a folioMatthew Wilcox (Oracle)1-21/+21
All but two of the callers already have a folio; pass a folio into try_to_free_buffers(). This removes the last user of cancel_dirty_page() so remove that wrapper function too. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09mm,fs: Remove aops->readpageMatthew Wilcox (Oracle)1-4/+1
With all implementations of aops->readpage converted to aops->read_folio, we can stop checking whether it's set and remove the member from aops. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09fs: Convert block_read_full_page() to block_read_full_folio()Matthew Wilcox (Oracle)1-25/+28
This function is NOT converted to handle large folios, so include an assert that the filesystem isn't passing one in. Otherwise, use the folio functions instead of the page functions, where they exist. Convert all filesystems which use block_read_full_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09fs: Introduce aops->read_folioMatthew Wilcox (Oracle)1-1/+4
Change all the callers of ->readpage to call ->read_folio in preference, if it exists. This is a transitional duplication, and will be removed by the end of the series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-08buffer: Rewrite nobh_truncate_page() to use foliosMatthew Wilcox (Oracle)1-37/+27
- Calculate iblock directly instead of using a while loop - Move has_buffers to the end to remove a backwards jump - Use __filemap_get_folio() instead of grab_cache_page(), which removes a spurious FGP_ACCESSED flag. - Eliminate length and pos variables - Use folio APIs where they exist Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08fs: Convert is_dirty_writeback() to take a folioMatthew Wilcox (Oracle)1-8/+8
Pass a folio instead of a page to aops->is_dirty_writeback(). Convert both implementations and the caller. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08buffer: Call aops write_begin() and write_end() directlyMatthew Wilcox (Oracle)1-6/+8
pagecache_write_begin() and pagecache_write_end() are now trivial wrappers, so call the aops directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08fs: Remove aop flags parameter from nobh_write_begin()Matthew Wilcox (Oracle)1-2/+1
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08fs: Remove aop flags parameter from grab_cache_page_write_begin()Matthew Wilcox (Oracle)1-2/+2
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08fs: Remove aop flags parameter from cont_write_begin()Matthew Wilcox (Oracle)1-1/+1
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08fs: Remove aop flags parameter from block_write_begin()Matthew Wilcox (Oracle)1-3/+3
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-01filemap: Remove AOP_FLAG_CONT_EXPANDMatthew Wilcox (Oracle)1-2/+1
This flag is no longer used, so remove it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-8/+5
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-51/+45
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-3/+5
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22mm: fs: fix lru_cache_disabled race in bh_lruMinchan Kim1-3/+5
Check lru_cache_disabled under bh_lru_lock. Otherwise, it could introduce race below and it fails to migrate pages containing buffer_head. CPU 0 CPU 1 bh_lru_install lru_cache_disable lru_cache_disabled = false atomic_inc(&lru_disable_count); invalidate_bh_lrus_cpu of CPU 0 bh_lru_lock __invalidate_bh_lrus bh_lru_unlock bh_lru_lock install the bh bh_lru_unlock WHen this race happens a CMA allocation fails, which is critical for the workload which depends on CMA. Link: https://lkml.kernel.org/r/20220308180709.2017638-1-minchan@kernel.org Fixes: 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration") Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-16fs: Convert __set_page_dirty_buffers to block_dirty_folioMatthew Wilcox (Oracle)1-18/+15
Convert all callers; mostly this is just changing the aops to point at it, but a few implementations need a little more work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15fs: Turn block_invalidatepage into block_invalidate_folioMatthew Wilcox (Oracle)1-19/+18
Remove special-casing of a NULL invalidatepage, since there is no more block_invalidatepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14fs: Convert is_partially_uptodate to foliosMatthew Wilcox (Oracle)1-14/+12
Since the uptodate property is maintained on a per-folio basis, the is_partially_uptodate method should also take a folio. Fix the types at the same time so it's clear that it returns true/false and takes the count in bytes, not blocks. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig1-8/+5
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02block: pass a block_device and opf to bio_allocChristoph Hellwig1-8/+6
Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16fs/buffer: Convert __block_write_begin_int() to take a folioMatthew Wilcox (Oracle)1-11/+12
There are no plans to convert buffer_head infrastructure to use large folios, but __block_write_begin_int() is called from iomap, and it's more convenient and less error-prone if we pass in a folio from iomap. It also has a nice saving of almost 200 bytes of code from removing repeated calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-18fs: simplify init_page_buffersChristoph Hellwig1-1/+1
No need to convert from bdev to inode and back. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211018101130.1838532-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18fs: use bdev_nr_bytes instead of open coding it in blkdev_max_blockChristoph Hellwig1-1/+1
Use the proper helper to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20211018101130.1838532-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24mm: fs: invalidate bh_lrus for only cold pathMinchan Kim1-2/+6
The kernel test robot reported the regression of fio.write_iops[1] with commit 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration"). Since lru_add_drain is called frequently, invalidate bh_lrus there could increase bh_lrus cache miss ratio, which needs more IO in the end. This patch moves the bh_lrus invalidation from the hot path( e.g., zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all, lru_cache_disable). Zhengjun Xing confirmed "I test the patch, the regression reduced to -2.9%" [1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/ [2] 8cc621d2f45d, mm: fs: invalidate BH LRU during page migration Link: https://lkml.kernel.org/r/20210907212347.1977686-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Tested-by: "Xing, Zhengjun" <zhengjun.xing@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-01Merge branch 'exit-cleanups-for-v5.15' of ↵Linus Torvalds1-27/+0
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit cleanups from Eric Biederman: "In preparation of doing something about PTRACE_EVENT_EXIT I have started cleaning up various pieces of code related to do_exit. Most of that code I did not manage to get tested and reviewed before the merge window opened but a handful of very useful cleanups are ready to be merged. The first change is simply the removal of the bdflush system call. The code has now been disabled long enough that even the oldest userspace working userspace setups anyone can find to test are fine with the bdflush system call being removed. Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of calling do_exit directly is interesting only in that it is nearly the most difficult of the incorrect uses of do_exit to remove. The change to the seccomp code to simply send a signal instead of calling do_coredump directly is a very nice little cleanup made possible by realizing the existing signal sending helpers were missing a little bit of functionality that is easy to provide" * 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal/seccomp: Dump core when there is only one live thread signal/seccomp: Refactor seccomp signal and coredump generation signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die exit/bdflush: Remove the deprecated bdflush system call
2021-08-16fs: mark the iomap argument to __block_write_begin_int constChristoph Hellwig1-2/+2
__block_write_begin_int never modifies the passed in iomap, so mark it const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-12exit/bdflush: Remove the deprecated bdflush system callEric W. Biederman1-27/+0
The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of of the bdflush system call is unaffected by it's removal. Since the code is not needed delete it. [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com Link: https://lkml.kernel.org/r/87sg10quue.fsf_-_@disp2133 Tested-by: Michael Schmitz <schmitzmic@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-06-29mm/writeback: move __set_page_dirty() to core mmMatthew Wilcox (Oracle)1-24/+0
Patch series "Further set_page_dirty cleanups". Prompted by Christoph's recent patches, here are some more patches to improve the state of set_page_dirty(). They're all from the folio tree, so they've been tested to a certain extent. This patch (of 6): Nothing in __set_page_dirty() is specific to buffer_head, so move it to mm/page-writeback.c. That removes the only caller of account_page_dirtied() outside of page-writeback.c, so make it static. Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29fs: unexport __set_page_dirtyChristoph Hellwig1-1/+0
Patch series "remove the implicit .set_page_dirty default". This series cleans up a few lose ends around ->set_page_dirty, most importantly removes the default to the buffer head based on if no method is wired up. This patch (of 3): __set_page_dirty is only used by built-in code. Link: https://lkml.kernel.org/r/20210614061512.3966143-1-hch@lst.de Link: https://lkml.kernel.org/r/20210614061512.3966143-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-6/+30
Merge more updates from Andrew Morton: "The remainder of the main mm/ queue. 143 patches. Subsystems affected by this patch series (all mm): pagecache, hugetlb, userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap, kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and kfence" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (143 commits) kfence: use power-efficient work queue to run delayed work kfence: maximize allocation wait timeout duration kfence: await for allocation using wait_event kfence: zero guard page after out-of-bounds access mm/process_vm_access.c: remove duplicate include mm/mempool: minor coding style tweaks mm/highmem.c: fix coding style issue btrfs: use memzero_page() instead of open coded kmap pattern iov_iter: lift memzero_page() to highmem.h mm/zsmalloc: use BUG_ON instead of if condition followed by BUG. mm/zswap.c: switch from strlcpy to strscpy arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE mm,memory_hotplug: add kernel boot option to enable memmap_on_memory acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported mm,memory_hotplug: allocate memmap from the added memory range mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count() mm,memory_hotplug: relax fully spanned sections check drivers/base/memory: introduce memory_block_{online,offline} mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove ...
2021-05-05mm: fs: invalidate BH LRU during page migrationMinchan Kim1-6/+30
Pages containing buffer_heads that are in one of the per-CPU buffer_head LRU caches will be pinned and thus cannot be migrated. This can prevent CMA allocations from succeeding, which are often used on platforms with co-processors (such as a DSP) that can only use physically contiguous memory. It can also prevent memory hot-unplugging from succeeding, which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1 GiB based on the architecture in use. Correspondingly, invalidate the BH LRU caches before a migration starts and stop any buffer_head from being cached in the LRU caches, until migration has finished. Link: https://lkml.kernel.org/r/20210319175127.886124-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Reported-by: Laura Abbott <labbott@kernel.org> Tested-by: Oliver Sang <oliver.sang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Dias <joaodias@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-22buffer: a small optimization in grow_buffersMikulas Patocka1-5/+1
This patch replaces a loop with a "tzcnt" instruction. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-02-24fs: buffer: use raw page_memcg() on locked pageJohannes Weiner1-2/+2
alloc_page_buffers() currently uses get_mem_cgroup_from_page() for charging the buffers to the page owner, which does an rcu-protected page->memcg lookup and acquires a reference. But buffer allocation has the page lock held throughout, which pins the page to the memcg and thereby the memcg - neither rcu nor holding an extra reference during the allocation are necessary. Use a raw page_memcg() instead. This was the last user of get_mem_cgroup_from_page(), delete it. Link: https://lkml.kernel.org/r/20210209190126.97842-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24fs/buffer.c: add checking buffer head stat before clearYang Guo1-1/+2
clear_buffer_new() is used to clear buffer new stat. When PAGE_SIZE is 64K, most buffer heads in the list are not needed to clear. clear_buffer_new() has an enpensive atomic modification operation, Let's add checking buffer head before clear it as __block_write_begin_int does which is good for performance. Link: https://lkml.kernel.org/r/1612332890-57918-1-git-send-email-zhangshaokun@hisilicon.com Signed-off-by: Yang Guo <guoyang2@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-02mm: memcontrol: Use helpers to read page's memcg dataRoman Gushchin1-1/+1
Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-01fs: simplify freeze_bdev/thaw_bdevChristoph Hellwig1-1/+1
Store the frozen superblock in struct block_device to avoid the awkward interface that can return a sb only used a cookie, an ERR_PTR or NULL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-18mm, memcg: rework remote charging API to support nestingRoman Gushchin1-3/+3
Currently the remote memcg charging API consists of two functions: memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the memcg value, which overwrites the memcg of the current task. memalloc_use_memcg(target_memcg); <...> memalloc_unuse_memcg(); It works perfectly for allocations performed from a normal context, however an attempt to call it from an interrupt context or just nest two remote charging blocks will lead to an incorrect accounting. On exit from the inner block the active memcg will be cleared instead of being restored. memalloc_use_memcg(target_memcg); memalloc_use_memcg(target_memcg_2); <...> memalloc_unuse_memcg(); Error: allocation here are charged to the memcg of the current process instead of target_memcg. memalloc_unuse_memcg(); This patch extends the remote charging API by switching to a single function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg), which sets the new value and returns the old one. So a remote charging block will look like: old_memcg = set_active_memcg(target_memcg); <...> set_active_memcg(old_memcg); This patch is heavily based on the patch by Johannes Weiner, which can be found here: https://lkml.org/lkml/2020/5/28/806 . Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-07fs: Don't invalidate page buffers in block_write_full_page()Jan Kara1-16/+0
If block_write_full_page() is called for a page that is beyond current inode size, it will truncate page buffers for the page and return 0. This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3 BUG due to race with truncate") in history.git tree to fix a problem with ext3 in data=ordered mode. This particular problem doesn't exist anymore because ext3 is long gone and ext4 handles ordered data differently. Also normally buffers are invalidated by truncate code and there's no need to specially handle this in ->writepage() code. This invalidation of page buffers in block_write_full_page() is causing issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk under filesystem's hands and metadata buffers get discarded while being tracked by the journalling layer. Although it is obviously "not supported" it can cause kernel crashes like: [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at +0000000000000008 [ 7986.697197] PGD 0 P4D 0 [ 7986.699724] Oops: 0002 [#1] SMP PTI [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G +O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1 [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015 [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2] ... [ 7986.810150] Call Trace: [ 7986.812595] __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2] [ 7986.818408] jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2] [ 7986.836467] kjournald2+0xbd/0x270 [jbd2] which is not great. The crash happens because bh->b_private is suddently NULL although BH_JBD flag is still set (this is because block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup found buffer without BH_Mapped set, called init_page_buffers() which has rewritten bh->b_private). So just remove the invalidation in block_write_full_page(). Note that the buffer cache invalidation when block device changes size is already careful to avoid similar problems by using invalidate_mapping_pages() which skips busy buffers so it was only this odd block_write_full_page() behavior that could tear down bdev buffers under filesystem's hands. Reported-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...
2020-08-07fs: prevent BUG_ON in submit_bh_wbc()Xianting Tian1-0/+9
If a device is hot-removed --- for example, when a physical device is unplugged from pcie slot or a nbd device's network is shutdown --- this can result in a BUG_ON() crash in submit_bh_wbc(). This is because the when the block device dies, the buffer heads will have their Buffer_Mapped flag get cleared, leading to the crash in submit_bh_wbc. We had attempted to work around this problem in commit a17712c8 ("ext4: check superblock mapped prior to committing"). Unfortunately, it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the device dies between when the work-around check in ext4_commit_super() and when submit_bh_wbh() is finally called: Code path: ext4_commit_super judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8 lock_buffer(sbh) ... unlock_buffer(sbh) __sync_dirty_buffer(sbh,... lock_buffer(sbh) judge if 'buffer_mapped(sbh))' is false, return <== added by this patch submit_bh(...,sbh) submit_bh_wbc(...,sbh,...) [100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc() [100722.966503] invalid opcode: 0000 [#1] SMP [100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000 [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190 [100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246 [100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000 [100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001 [100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d [100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800 [100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000 [100722.966580] FS: 00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000 [100722.966580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0 [100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [100722.966583] PKRU: 55555554 [100722.966583] Call Trace: [100722.966588] __sync_dirty_buffer+0x6e/0xd0 [100722.966614] ext4_commit_super+0x1d8/0x290 [ext4] [100722.966626] __ext4_std_error+0x78/0x100 [ext4] [100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4] [100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4] [100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4] [100722.966663] ext4_mark_inode_dirty+0x53/0x1e0 [ext4] [100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4] [100722.966679] ext4_dirty_inode+0x48/0x70 [ext4] [100722.966682] __mark_inode_dirty+0x17f/0x350 [100722.966686] generic_update_time+0x87/0xd0 [100722.966687] touch_atime+0xa9/0xd0 [100722.966690] generic_file_read_iter+0xa09/0xcd0 [100722.966694] ? page_cache_tree_insert+0xb0/0xb0 [100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4] [100722.966707] ? __inode_security_revalidate+0x4f/0x60 [100722.966709] __vfs_read+0xec/0x160 [100722.966711] vfs_read+0x8c/0x130 [100722.966712] SyS_pread64+0x87/0xb0 [100722.966716] do_syscall_64+0x67/0x1b0 [100722.966719] entry_SYSCALL64_slow_path+0x25/0x25 To address this, add the check of 'buffer_mapped(bh)' to __sync_dirty_buffer(). This also has the benefit of fixing this for other file systems. With this addition, we can drop the workaround in ext4_commit_supper(). [ Commit description rewritten by tytso. ] Signed-off-by: Xianting Tian <xianting_tian@126.com> Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-03Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+0
Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...
2020-07-08ext4: add inline encryption supportEric Biggers1-3/+4
Wire up ext4 to support inline encryption via the helper functions which fs/crypto/ now provides. This includes: - Adding a mount option 'inlinecrypt' which enables inline encryption on encrypted files where it can be used. - Setting the bio_crypt_ctx on bios that will be submitted to an inline-encrypted file. Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for this part, since ext4 sometimes uses ll_rw_block() on file data. - Not adding logically discontiguous data to bios that will be submitted to an inline-encrypted file. - Not doing filesystem-layer crypto on inline-encrypted files. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-07-01fs: remove a weird comment in submit_bh_wbcChristoph Hellwig1-5/+0
All bios can get remapped if submitted to partitions. No need to comment on that. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-02fs/buffer.c: use attach/detach_page_privateGuoqing Jiang1-12/+4
Since the new pair function is introduced, we can call them to clean the code in buffer.c. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200517214718.468-5-guoqing.jiang@cloud.ionos.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02fs/buffer.c: record blockdev write errors in super_block that it backsJeff Layton1-0/+7
When syncing out a block device (a'la __sync_blockdev), any error encountered will only be recorded in the bd_inode's mapping. When the blockdev contains a filesystem however, we'd like to also record the error in the super_block that's stored there. Make mark_buffer_write_io_error also record the error in the corresponding super_block when a writeback error occurs and the block device contains a mounted superblock. Since superblocks are RCU freed, hold the rcu_read_lock to ensure that the superblock doesn't go away while we're marking it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andres Freund <andres@anarazel.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-24Merge tag 'block-5.7-2020-04-24' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: "A few fixes/changes that should go into this release: - null_blk zoned fixes (Damien) - blkdev_close() sync improvement (Douglas) - Fix regression in blk-iocost that impacted (at least) systemtap (Waiman) - Comment fix, header removal (Zhiqiang, Jianpeng)" * tag 'block-5.7-2020-04-24' of git://git.kernel.dk/linux-block: null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
2020-04-17buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.Zhiqiang Liu1-1/+1
free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory are no longer needed. Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-15ext4: use non-movable memory for superblock readaheadRoman Gushchin1-0/+11
Since commit a8ac900b8163 ("ext4: use non-movable memory for the superblock") buffers for ext4 superblock were allocated using the sb_bread_unmovable() helper which allocated buffer heads out of non-movable memory blocks. It was necessarily to not block page migrations and do not cause cma allocation failures. However commit 85c8f176a611 ("ext4: preload block group descriptors") broke this by introducing pre-reading of the ext4 superblock. The problem is that __breadahead() is using __getblk() underneath, which allocates buffer heads out of movable memory. It resulted in page migration failures I've seen on a machine with an ext4 partition and a preallocated cma area. Fix this by introducing sb_breadahead_unmovable() and __breadahead_gfp() helpers which use non-movable memory for buffer head allocations and use them for the ext4 superblock readahead. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 85c8f176a611 ("ext4: preload block group descriptors") Signed-off-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-30Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-12/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Continued user-access cleanups in the futex code. - percpu-rwsem rewrite that uses its own waitqueue and atomic_t instead of an embedded rwsem. This addresses a couple of weaknesses, but the primary motivation was complications on the -rt kernel. - Introduce raw lock nesting detection on lockdep (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal lock differences. This too originates from -rt. - Reuse lockdep zapped chain_hlocks entries, to conserve RAM footprint on distro-ish kernels running into the "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep chain-entries pool. - Misc cleanups, smaller fixes and enhancements - see the changelog for details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t Documentation/locking/locktypes: Minor copy editor fixes Documentation/locking/locktypes: Further clarifications and wordsmithing m68knommu: Remove mm.h include from uaccess_no.h x86: get rid of user_atomic_cmpxchg_inatomic() generic arch_futex_atomic_op_inuser() doesn't need access_ok() x86: don't reload after cmpxchg in unsafe_atomic_op2() loop x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end() objtool: whitelist __sanitizer_cov_trace_switch() [parisc, s390, sparc64] no need for access_ok() in futex handling sh: no need of access_ok() in arch_futex_atomic_op_inuser() futex: arch_futex_atomic_op_inuser() calling conventions change completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() lockdep: Add posixtimer context tracing bits lockdep: Annotate irq_work lockdep: Add hrtimer context tracing bits lockdep: Introduce wait-type checks completion: Use simple wait queues sched/swait: Prepare usage in completions ...
2020-03-28fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_tThomas Gleixner1-12/+7
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de
2020-03-25block: move guard_bio_eod to bio.cChristoph Hellwig1-43/+0
This is bio layer functionality and not related to buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>