aboutsummaryrefslogtreecommitdiffstats
path: root/mm/percpu.c
AgeCommit message (Collapse)AuthorFilesLines
2025-09-08percpu: fix race on alloc failed warning limitVlad Dumitrescu1-8/+12
The 'allocation failed, ...' warning messages can cause unlimited log spam, contrary to the implementation's intent. The warn_limit variable is accessed without synchronization. If more than <warn_limit> threads enter the warning path at the same time, the variable will get decremented past 0. Once it becomes negative, the non-zero check will always return true leading to unlimited log spam. Use atomic operation to access warn_limit and change condition to test for non-negative (>= 0) - atomic_dec_if_positive will return -1 once warn_limit becomes 0. Continue to print disable message alongside the last warning. While the change cited in Fixes is only adjacent, the warning limit implementation was correct before it. Only non-atomic allocations were considered for warnings, and those happened to hold pcpu_alloc_mutex while accessing warn_limit. [vdumitrescu@nvidia.com: prevent warn_limit from going negative, per Christoph Lameter] Link: https://lkml.kernel.org/r/ee87cc59-2717-4dbb-8052-1d2692c5aaaa@nvidia.com Link: https://lkml.kernel.org/r/ab22061a-a62f-4429-945b-744e5cc4ba35@nvidia.com Fixes: f7d77dfc91f7 ("mm/percpu.c: print error message too if atomic alloc failed") Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-27mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo1-3/+3
Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [harry.yoo@oracle.com: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/20250822020727.202749-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-13mm/percpu: prevent concurrency problem for pcpu_nr_populated read with spin lockJeongjun Park1-1/+1
pcpu_nr_pages() reads pcpu_nr_populated without any protection, which causes a data race between read/write. However, since this is an intended race, we should add a data_race annotation instead of add a spin lock. [akpm@linux-foundation.org: move pcpu_nr_units multiplication outside data_race(), per Vlastimil] Link: https://lkml.kernel.org/r/20250703065600.132221-1-aha310510@gmail.com Fixes: 7e8a6304d541 ("/proc/meminfo: add percpu populated pages count") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reported-by: syzbot+e5bd32b79413e86f389e@syzkaller.appspotmail.com Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-16mm, percpu: do not consider sleepable allocations atomicMichal Hocko1-1/+7
28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") has fixed a reclaim recursion for scoped GFP_NOFS context. It has done that by avoiding taking pcpu_alloc_mutex. This is a correct solution as the worker context with full GFP_KERNEL allocation/reclaim power and which is using the same lock cannot block the NOFS pcpu_alloc caller. On the other hand this is a very conservative approach that could lead to failures because pcpu_alloc lockless implementation is quite limited. We have a bug report about premature failures when scsi array of 193 devices is scanned. Sometimes (not consistently) the scanning aborts because the iscsid daemon fails to create the queue for a random scsi device during the scan. iscsid itslef is running with PR_SET_IO_FLUSHER set so all allocations from this process context are GFP_NOIO. This in turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which leads to pre-mature failures. It has turned out that iscsid has worked around this by dropping PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382) when scanning host. But we can do better in this case on the kernel side and use pcpu_alloc_mutex for NOIO resp. NOFS constrained allocation scopes too. We just need the WQ worker to never trigger IO/FS reclaim. Achieve that by enforcing scoped GFP_NOIO for the whole execution of pcpu_balance_workfn (this will imply NOFS constrain as well). This will remove the dependency chain and preserve the full allocation power of the pcpu_alloc call. While at it make is_atomic really test for blockable allocations. Link: https://lkml.kernel.org/r/20250206122633.167896-1-mhocko@kernel.org Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dennis Zhou <dennis@kernel.org> Cc: Filipe David Manana <fdmanana@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-18percpu: Remove __per_cpu_loadBrian Gerst1-2/+2
__per_cpu_load is now always equal to __per_cpu_start. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250123190747.745588-15-brgerst@gmail.com
2025-01-25mm/memblock: add memblock_alloc_or_panic interfaceGuo Weikang1-56/+14
Before SLUB initialization, various subsystems used memblock_alloc to allocate memory. In most cases, when memory allocation fails, an immediate panic is required. To simplify this behavior and reduce repetitive checks, introduce `memblock_alloc_or_panic`. This function ensures that memory allocation failures result in a panic automatically, improving code readability and consistency across subsystems that require this behavior. [guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic] Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/ Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: use page->private instead of page->index in percpuMatthew Wilcox (Oracle)1-2/+2
The percpu allocator only uses one field in struct page, just change it from page->index to page->private. Link: https://lkml.kernel.org/r/20241005200121.3231142-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06percpu: fix data race with pcpu_nr_empty_pop_pagesDennis Zhou1-3/+4
Fixes the data race by moving the read to be behind the pcpu_lock. This is okay because the code (initially) above it will not increase the empty populated page count because it is populating backing pages that already have allocations served out of them. Link: https://lkml.kernel.org/r/20241008001942.8114-1-dennis@kernel.org Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407191651.f24e499d-oliver.sang@intel.com Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01percpu: remove pcpu_alloc_size()Jianhui Zhou1-31/+0
pcpu_alloc_size() was added in 7ac5c53e0073 "mm/percpu.c: introduce pcpu_alloc_size()", which is used to get the allocated memory size in bpf. However, pcpu_alloc_size() is no longer used in "bpf: Use c->unit_size to select target cache during free" because its actuall allocated memory size may change at runtime due to its slab merging mechanism. Therefore, pcpu_alloc_size() can be removed. Link: https://lkml.kernel.org/r/tencent_AD5C50E8D78C07A3CE539BD5F6BF39706507@qq.com Signed-off-by: Jianhui Zhou <912460177@qq.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: JonasZhou <JonasZhou@zhaoxin.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm: remove CONFIG_MEMCG_KMEMJohannes Weiner1-3/+3
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab tracking is enabled. It has been default-enabled and equivalent to CONFIG_MEMCG for almost a decade. We've only grown more kernel memory accounting sites since, and there is no imaginable cgroup usecase going forward that wants to track user pages but not the multitude of user-drivable kernel allocations. Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: percpu: enable per-cpu allocation taggingSuren Baghdasaryan1-56/+6
Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu to record allocations and deallocations done by these functions. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-6-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-30-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: percpu: add codetag reference into pcpuobj_extKent Overstreet1-0/+26
To store codetag for every per-cpu allocation, a codetag reference is embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to use the newly introduced codetag are added. Link: https://lkml.kernel.org/r/20240321163705.3067592-29-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: percpu: introduce pcpuobj_extKent Overstreet1-15/+15
Upcoming alloc tagging patches require a place to stash per-allocation metadata. We already do this when memcg is enabled, so this patch generalizes the obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext type, which we will be adding to in an upcoming patch - similarly to the previous slabobj_ext patch. Link: https://lkml.kernel.org/r/20240321163705.3067592-28-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: linux-mm@kvack.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-14mm: Introduce flush_cache_vmap_early()Alexandre Ghiti1-7/+1
The pcpu setup when using the page allocator sets up a new vmalloc mapping very early in the boot process, so early that it cannot use the flush_cache_vmap() function which may depend on structures not yet initialized (for example in riscv, we currently send an IPI to flush other cpus TLB). But on some architectures, we must call flush_cache_vmap(): for example, in riscv, some uarchs can cache invalid TLB entries so we need to flush the new established mapping to avoid taking an exception. So fix this by introducing a new function flush_cache_vmap_early() which is called right after setting the new page table entry and before accessing this new mapping. This new function implements a local flush tlb on riscv and is no-op for other architectures (same as today). Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-25percpu: scoped objcg protectionRoman Gushchin1-5/+3
Similar to slab and kmem, switch to a scope-based protection of the objcg pointer to avoid. Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-20mm/percpu.c: introduce pcpu_alloc_size()Hou Tao1-0/+31
Introduce pcpu_alloc_size() to get the size of the dynamic per-cpu area. It will be used by bpf memory allocator in the following patches. BPF memory allocator maintains per-cpu area caches for multiple area sizes and its free API only has the to-be-freed per-cpu pointer, so it needs the size of dynamic per-cpu area to select the corresponding cache when bpf program frees the dynamic per-cpu pointer. Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20mm/percpu.c: don't acquire pcpu_lock for pcpu_chunk_addr_search()Hou Tao1-3/+1
There is no need to acquire pcpu_lock for pcpu_chunk_addr_search(): 1) both pcpu_first_chunk & pcpu_reserved_chunk must have been initialized before the invocation of free_percpu(). 2) The dynamically-created chunk must be valid before the per-cpu pointers allocated from it are freed. So acquire pcpu_lock() after the invocation of pcpu_chunk_addr_search(). Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-25mm/percpu.c: print error message too if atomic alloc failedBaoquan He1-2/+4
The variable 'err' is assgigned to an error message if atomic alloc failed, while it has no chance to be printed if is_atomic is true. Here change to print error message too if atomic alloc failed, while avoid to call dump_stack() if that case. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-08-25mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bitBaoquan He1-23/+15
This removes the need of local varibale 'chunk', and optimize the code calling pcpu_alloc_first_chunk() to initialize reserved chunk and dynamic chunk to make it simpler. Signed-off-by: Baoquan He <bhe@redhat.com> [Dennis: reworded first chunk init comment] Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-08-25mm/percpu.c: remove redundant checkBaoquan He1-1/+0
The conditional check "(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE) has covered the check '(!ai->dyn_size)'. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-08-25mm/percpu: Remove some local variables in pcpu_populate_pteBibo Mao1-15/+9
In function pcpu_populate_pte there are already variable defined, it can be reused for later use, here remove duplicated local variables. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2023-02-16mm: memcontrol: rename memcg_kmem_enabled()Roman Gushchin1-1/+1
Currently there are two kmem-related helper functions with a confusing semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled(). The problem is that an obvious expectation memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(), can be false. mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is disabled using a boot time kernel option "cgroup.memory=nokmem". It never changes the value dynamically. memcg_kmem_enabled() is different: it always returns false until the first non-root memory cgroup will get online (assuming the kernel memory accounting is enabled). It's goal is to improve the performance on systems without the cgroupfs mounted/memory controller enabled or on the systems with only the root memory cgroup. To make things more obvious and avoid potential bugs, let's rename memcg_kmem_enabled() to memcg_kmem_online(). Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-07mm/percpu.c: remove the lcm code since block size is fixed at page sizeBaoquan He1-10/+2
Since commit b239f7daf553 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE"), the PCPU_BITMAP_BLOCK_SIZE has been set to page size fixedly. So the lcm code in pcpu_alloc_first_chunk() doesn't make sense any more, clean it up. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-11-07mm/percpu: replace the goto with breakBaoquan He1-2/+1
In function pcpu_reclaim_populated(), the line of goto jumping is unnecessary since the label 'end_chunk' is near the end of the for loop, use break instead. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-11-07mm/percpu: add comment to state the empty populated pages accountingBaoquan He1-3/+11
When allocating an area from a chunk, pcpu_block_update_hint_alloc() is called to update chunk metadata, including chunk's and global nr_empty_pop_pages. However, if the allocation is not atomic, some blocks may not be populated with pages yet, while we still subtract the number here. The number of pages will be added back with pcpu_chunk_populated() when populating pages. Adding code comment to make that more understandable. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-11-07mm/percpu: Update the code comment when creating new chunkBaoquan He1-5/+1
The lock pcpu_alloc_mutex taking code has been moved to the beginning of pcpu_allo() if it's non atomic allocation. So the code comment above above pcpu_create_chunk() callsite need be updated. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-11-07mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated()Baoquan He1-3/+3
To replace list_empty()/list_first_entry() pair to simplify code. Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-11-07mm/percpu: remove unused pcpu_map_extend_chunksBaoquan He1-3/+0
Since commit 40064aeca35c ("percpu: replace area map allocator with bitmap"), it is unneeded. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2022-07-17mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()Patrick Wang1-3/+3
Kmemleak recently added a rbtree to store the objects allocted with physical address. Those objects can't be freed with kmemleak_free(). According to the comments, percpu allocations are tracked by kmemleak separately. Kmemleak_free() was used to avoid the unnecessary tracking. If kmemleak_free() fails, those objects would be scanned by kmemleak, which is unnecessary but shouldn't lead to other effects. Use kmemleak_ignore_phys() instead of kmemleak_free() for those objects. Link: https://lkml.kernel.org/r/20220705113158.127600-1-patrick.wang.shcn@gmail.com Fixes: 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13percpu: improve percpu_alloc_percpu event traceVasily Averin1-2/+3
Add call_site, bytes_alloc and gfp_flags fields to the output of the percpu_alloc_percpu ftrace event: mkdir-4393 [001] 169.334788: percpu_alloc_percpu: call_site=mem_cgroup_css_alloc+0xa6 reserved=0 is_atomic=0 size=2408 align=8 base_addr=0xffffc7117fc00000 off=402176 ptr=0x3dc867a62300 bytes_alloc=14448 gfp_flags=GFP_KERNEL_ACCOUNT This is required to track memcg-accounted percpu allocations. Link: https://lkml.kernel.org/r/a07be858-c8a3-7851-9086-e3262cbcf707@openvz.org Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-01-23Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linuxLinus Torvalds1-19/+16
Pull bitmap updates from Yury Norov: - introduce for_each_set_bitrange() - use find_first_*_bit() instead of find_next_*_bit() where possible - unify for_each_bit() macros * tag 'bitmap-5.17-rc1' of git://github.com/norov/linux: vsprintf: rework bitmap_list_string lib: bitmap: add performance test for bitmap_print_to_pagebuf bitmap: unify find_bit operations mm/percpu: micro-optimize pcpu_is_populated() Replace for_each_*_bit_from() with for_each_*_bit() where appropriate find: micro-optimize for_each_{set,clear}_bit() include/linux: move for_each_bit() macros from bitops.h to find.h cpumask: replace cpumask_next_* with cpumask_first_* where appropriate tools: sync tools/bitmap with mother linux all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate cpumask: use find_first_and_bit() lib: add find_first_and_bit() arch: remove GENERIC_FIND_FIRST_BIT entirely include: move find.h from asm_generic to linux bitops: move find_bit_*_le functions from le.h to find.h bitops: protect find_first_{,zero}_bit properly
2022-01-20Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-34/+120
Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ...
2022-01-20mm: percpu: add generic pcpu_populate_pte() functionKefeng Wang1-5/+71
With NEED_PER_CPU_PAGE_FIRST_CHUNK enabled, we need a function to populate pte, this patch adds a generic pcpu populate pte function, pcpu_populate_pte(), which is marked __weak and used on most architectures, but it is overridden on x86, which has its own implementation. Link: https://lkml.kernel.org/r/20211216112359.103822-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20mm: percpu: add generic pcpu_fc_alloc/free funcitonKefeng Wang1-31/+47
With the previous patch, we could add a generic pcpu first chunk allocate and free function to cleanup the duplicated definations on each architecture. Link: https://lkml.kernel.org/r/20211216112359.103822-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-20mm: percpu: add pcpu_fc_cpu_to_node_fn_t typedefKefeng Wang1-5/+9
Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu first chunk allocation will call it to alloc memblock on the corresponding node by it, this is prepare for the next patch. Link: https://lkml.kernel.org/r/20211216112359.103822-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15bitmap: unify find_bit operationsYury Norov1-12/+8
bitmap_for_each_{set,clear}_region() are similar to for_each_bit() macros in include/linux/find.h, but interface and implementation of them are different. This patch adds for_each_bitrange() macros and drops unused bitmap_*_region() API in sake of unification. Signed-off-by: Yury Norov <yury.norov@gmail.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
2022-01-15mm/percpu: micro-optimize pcpu_is_populated()Yury Norov1-7/+8
bitmap_next_clear_region() calls find_next_zero_bit() and find_next_bit() sequentially to find a range of clear bits. In case of pcpu_is_populated() there's a chance to return earlier if bitmap has all bits set. Signed-off-by: Yury Norov <yury.norov@gmail.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Dennis Zhou <dennis@kernel.org>
2022-01-15mm: memcg/percpu: account extra objcg space to memory cgroupsQi Zheng1-5/+5
Similar to slab memory allocator, for each accounted percpu object there is an extra space which is used to store obj_cgroup membership. Charge it too. [akpm@linux-foundation.org: fix layout] Link: https://lkml.kernel.org/r/20211126040606.97836-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: use memblock_free for freeing virtual pointersMike Rapoport1-4/+4
Rename memblock_free_ptr() to memblock_free() and use memblock_free() when freeing a virtual pointer so that memblock_free() will be a counterpart of memblock_alloc() The callers are updated with the below semantic patch and manual addition of (void *) casting to pointers that are represented by unsigned long variables. @@ identifier vaddr; expression size; @@ ( - memblock_phys_free(__pa(vaddr), size); + memblock_free(vaddr, size); | - memblock_free_ptr(vaddr, size); + memblock_free(vaddr, size); ) [sfr@canb.auug.org.au: fixup] Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: rename memblock_free to memblock_phys_freeMike Rapoport1-4/+4
Since memblock_free() operates on a physical range, make its name reflect it and rename it to memblock_phys_free(), so it will be a logical counterpart to memblock_phys_alloc(). The callers are updated with the below semantic patch: @@ expression addr; expression size; @@ - memblock_free(addr, size); + memblock_phys_free(addr, size); Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06memblock: drop memblock_free_early_nid() and memblock_free_early()Mike Rapoport1-4/+4
memblock_free_early_nid() is unused and memblock_free_early() is an alias for memblock_free(). Replace calls to memblock_free_early() with calls to memblock_free() and remove memblock_free_early() and memblock_free_early_nid(). Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+0
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08percpu: remove export of pcpu_base_addrGreg Kroah-Hartman1-1/+0
This is not needed by any modules, so remove the export. Link: https://lkml.kernel.org/r/20210722185814.504541-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()Jing Xiangfeng1-3/+0
Commit b239f7daf553 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE") removed the parameter 'for_alloc', so remove this comment. Link: https://lkml.kernel.org/r/1630576043-21367-1-git-send-email-jingxiangfeng@huawei.com Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-04percpu: flush tlb in pcpu_reclaim_populated()Dennis Zhou1-6/+26
Prior to "percpu: implement partial chunk depopulation", pcpu_depopulate_chunk() was called only on the destruction path. This meant the virtual address range was on its way back to vmalloc which will handle flushing the tlbs for us. However, with pcpu_reclaim_populated(), we are now calling pcpu_depopulate_chunk() during the active lifecycle of a chunk. Therefore, we need to flush the tlb as well otherwise we can end up accessing the wrong page through an invalid tlb mapping as reported in [1]. [1] https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/ Fixes: f183324133ea ("percpu: implement partial chunk depopulation") Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-07-01Merge branch 'for-5.14' of ↵Linus Torvalds1-97/+271
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu updates from Dennis Zhou: - percpu chunk depopulation - depopulate backing pages for chunks with empty pages when we exceed a global threshold without those pages. This lets us reclaim a portion of memory that would previously be lost until the full chunk would be freed (possibly never). - memcg accounting cleanup - previously separate chunks were managed for normal allocations and __GFP_ACCOUNT allocations. These are now consolidated which cleans up the code quite a bit. - a few misc clean ups for clang warnings * 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: optimize locking in pcpu_balance_workfn() percpu: initialize best_upa variable percpu: rework memcg accounting mm, memcg: introduce mem_cgroup_kmem_disabled() mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init percpu: make symbol 'pcpu_free_slot' static percpu: implement partial chunk depopulation percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 percpu: factor out pcpu_check_block_hint() percpu: split __pcpu_balance_workfn() percpu: fix a comment about the chunks ordering
2021-06-17percpu: optimize locking in pcpu_balance_workfn()Roman Gushchin1-12/+29
pcpu_balance_workfn() unconditionally calls pcpu_balance_free(), pcpu_reclaim_populated(), pcpu_balance_populated() and pcpu_balance_free() again. Each call to pcpu_balance_free() and pcpu_reclaim_populated() will cause at least one acquisition of the pcpu_lock. So even if the balancing was scheduled because of a failed atomic allocation, pcpu_lock will be acquired at least 4 times. This obviously increases the contention on the pcpu_lock. To optimize the scheme let's grab the pcpu_lock on the upper level (in pcpu_balance_workfn()) and keep it generally locked for the whole duration of the scheduled work, but release conditionally to perform any slow operations like chunk (de)population and creation of new chunks. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-06-14percpu: initialize best_upa variableDennis Zhou1-0/+2
Tom reported this finding from clang 10's static analysis [1]. Due to the way the code is written, it will always see a successful loop iteration. Instead of setting an initial value, check that it was set instead with BUG_ON() because 0 units per allocation is bogus. [1] https://lore.kernel.org/lkml/20210515180817.1751084-1-trix@redhat.com/ Reported-by: Tom Rix <trix@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-06-05percpu: rework memcg accountingRoman Gushchin1-81/+53
The current implementation of the memcg accounting of the percpu memory is based on the idea of having two separate sets of chunks for accounted and non-accounted memory. This approach has an advantage of not wasting any extra memory for memcg data for non-accounted chunks, however it complicates the code and leads to a higher chunks number due to a lower chunk utilization. Instead of having two chunk types it's possible to declare all* chunks memcg-aware unless the kernel memory accounting is disabled globally by a boot option. The size of objcg_array is usually small in comparison to chunks themselves (it obviously depends on the number of CPUs), so even if some chunk will have no accounted allocations, the memory waste isn't significant and will likely be compensated by a higher chunk utilization. Also, with time more and more percpu allocations will likely become accounted. * The first chunk is initialized before the memory cgroup subsystem, so we don't know for sure whether we need to allocate obj_cgroups. Because it's small, let's make it free for use. Then we don't need to allocate obj_cgroups for it. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-05-14percpu: make symbol 'pcpu_free_slot' staticWei Yongjun1-1/+1
The sparse tool complains as follows: mm/percpu.c:138:5: warning: symbol 'pcpu_free_slot' was not declared. Should it be static? This symbol is not used outside of percpu.c, so marks it static. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-05-07mm: fix typos in commentsIngo Molnar1-1/+1
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-21percpu: implement partial chunk depopulationRoman Gushchin1-15/+165
From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <guro@fb.com> Co-developed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Tested-by: Pratik Sampat <psampat@linux.ibm.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1Dennis Zhou1-6/+8
This prepares for adding a to_depopulate list and sidelined list after the free slot in the set of lists in pcpu_slot. Signed-off-by: Dennis Zhou <dennis@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21percpu: factor out pcpu_check_block_hint()Roman Gushchin1-7/+23
Factor out the pcpu_check_block_hint() helper, which will be useful in the future. The new function checks if the allocation can likely fit within the contig hint. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-16percpu: split __pcpu_balance_workfn()Roman Gushchin1-17/+29
__pcpu_balance_workfn() became fairly big and hard to follow, but in fact it consists of two fully independent parts, responsible for the destruction of excessive free chunks and population of necessarily amount of free pages. In order to simplify the code and prepare for adding of a new functionality, split it in two functions: 1) pcpu_balance_free, 2) pcpu_balance_populated. Move the taking/releasing of the pcpu_alloc_mutex to an upper level to keep the current synchronization in place. Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-16percpu: fix a comment about the chunks orderingRoman Gushchin1-1/+4
Since the commit 3e54097beb22 ("percpu: manage chunks based on contig_bits instead of free_bytes") chunks are sorted based on the size of the biggest continuous free area instead of the total number of free bytes. Update the corresponding comment to reflect this. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-09percpu: make pcpu_nr_empty_pop_pages per chunk typeRoman Gushchin1-7/+7
nr_empty_pop_pages is used to guarantee that there are some free populated pages to satisfy atomic allocations. Accounted and non-accounted allocations are using separate sets of chunks, so both need to have a surplus of empty pages. This commit makes pcpu_nr_empty_pop_pages and the corresponding logic per chunk type. [Dennis] This issue came up as I was reviewing [1] and realized I missed this. Simultaneously, it was reported btrfs was seeing failed atomic allocations in fsstress tests [2] and [3]. [1] https://lore.kernel.org/linux-mm/20210324190626.564297-1-guro@fb.com/ [2] https://lore.kernel.org/linux-mm/20210401185158.3275.409509F4@e16-tech.com/ [3] https://lore.kernel.org/linux-mm/CAL3q7H5RNBjCi708GH7jnczAOe0BLnacT9C+OBgA-Dx9jhB6SQ@mail.gmail.com/ Fixes: 3c7be18ac9a0 ("mm: memcg/percpu: account percpu memory to memory cgroups") Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Roman Gushchin <guro@fb.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-02-14percpu: fix clang modpost section mismatchDennis Zhou1-1/+1
pcpu_build_alloc_info() is an __init function that makes a call to cpumask_clear_cpu(). With CONFIG_GCOV_PROFILE_ALL enabled, the inline heuristics are modified and such cpumask_clear_cpu() which is marked inline doesn't get inlined. Because it works on mask in __initdata, modpost throws a section mismatch error. Arnd sent a patch with the flatten attribute as an alternative [2]. I've added it to compiler_attributes.h. modpost complaint: WARNING: modpost: vmlinux.o(.text+0x735425): Section mismatch in reference from the function cpumask_clear_cpu() to the variable .init.data:pcpu_build_alloc_info.mask The function cpumask_clear_cpu() references the variable __initdata pcpu_build_alloc_info.mask. This is often because cpumask_clear_cpu lacks a __initdata annotation or the annotation of pcpu_build_alloc_info.mask is wrong. clang output: mm/percpu.c:2724:5: remark: cpumask_clear_cpu not inlined into pcpu_build_alloc_info because too costly to inline (cost=725, threshold=325) [-Rpass-missed=inline] [1] https://lore.kernel.org/linux-mm/202012220454.9F6Bkz9q-lkp@intel.com/ [2] https://lore.kernel.org/lkml/CAK8P3a2ZWfNeXKSm8K_SUhhwkor17jFo3xApLXjzfPqX0eUDUA@mail.gmail.com/ Reported-by: kernel test robot <lkp@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-02-14percpu: reduce the number of cpu distance comparisonsWonhyuk Yang1-14/+20
To build group_map[] and group_cnt[], we find out which group CPUs belong to by comparing the distance of the cpu. However, this includes cases where comparisons are not required. This patch uses a bitmap to record CPUs that is not classified in the group. CPUs that we know which group they belong to should be cleared from the bitmap. In result, we can reduce the number of unnecessary comparisons. Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> [Dennis: added cpumask_clear() call and #include cpumask.h.]
2020-10-30percpu: convert flexible array initializers to use struct_size()Dennis Zhou1-4/+4
Use the safer macro as sparked by the long discussion in [1]. [1] https://lore.kernel.org/lkml/20200917204514.GA2880159@google.com/ Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-10-18mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()Roman Gushchin1-2/+1
Patch series "mm: kmem: kernel memory accounting in an interrupt context". This patchset implements memcg-based memory accounting of allocations made from an interrupt context. Historically, such allocations were passed unaccounted mostly because charging the memory cgroup of the current process wasn't an option. Also performance reasons were likely a reason too. The remote charging API allows to temporarily overwrite the currently active memory cgroup, so that all memory allocations are accounted towards some specified memory cgroup instead of the memory cgroup of the current process. This patchset extends the remote charging API so that it can be used from an interrupt context. Then it removes the fence that prevented the accounting of allocations made from an interrupt context. It also contains a couple of optimizations/code refactorings. This patchset doesn't directly enable accounting for any specific allocations, but prepares the code base for it. The bpf memory accounting will likely be the first user of it: a typical example is a bpf program parsing an incoming network packet, which allocates an entry in hashmap map to store some information. This patch (of 4): Currently memcg_kmem_bypass() is called before obtaining the current memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the number of call sites and allows further code simplifications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-17percpu: fix first chunk size calculation for populated bitmapSunghyun Jin1-1/+1
Variable populated, which is a member of struct pcpu_chunk, is used as a unit of size of unsigned long. However, size of populated is miscounted. So, I fix this minor part. Fixes: 8ab16c43ea79 ("percpu: change the number of pages marked in the first_chunk pop bitmap") Cc: <stable@vger.kernel.org> # 4.14+ Signed-off-by: Sunghyun Jin <mcsmonk@gmail.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-08-12mm: memcg/percpu: per-memcg percpu memory statisticsRoman Gushchin1-0/+10
Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [guro@fb.com: v3] Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: memcg/percpu: account percpu memory to memory cgroupsRoman Gushchin1-21/+164
Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning] [akpm@linux-foundation.org: move unreachable code, per Roman] [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12percpu: return number of released bytes from pcpu_free_area()Roman Gushchin1-3/+10
Patch series "mm: memcg accounting of percpu memory", v3. This patchset adds percpu memory accounting to memory cgroups. It's based on the rework of the slab controller and reuses concepts and features introduced for the per-object slab accounting. Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. Percpu allocations by their nature are scattered over multiple pages, so they can't be tracked on the per-page basis. So the per-object tracking introduced by the new slab controller is reused. The patchset implements charging of percpu allocations, adds memcg-level statistics, enables accounting for percpu allocations made by memory cgroup internals and provides some basic tests. To implement the accounting of percpu memory without a significant memory and performance overhead the following approach is used: all accounted allocations are placed into a separate percpu chunk (or chunks). These chunks are similar to default chunks, except that they do have an attached vector of pointers to obj_cgroup objects, which is big enough to save a pointer for each allocated object. On the allocation, if the allocation has to be accounted (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root memory cgroup, etc), the memory cgroup is getting charged and if the maximum limit is not exceeded the allocation is performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the allocation is forced over the limit, depending on gfp (as any other kernel memory allocation). The memory cgroup information is saved in the obj_cgroup vector at the corresponding offset. On the release time the memcg information is restored from the vector and the cgroup is getting uncharged. Unaccounted allocations (at this point the absolute majority of all percpu allocations) are performed in the old way, so no additional overhead is expected. To avoid pinning dying memory cgroups by outstanding allocations, obj_cgroup API is used instead of directly saving memory cgroup pointers. obj_cgroup is basically a pointer to a memory cgroup with a standalone reference counter. The trick is that it can be atomically swapped to point at the parent cgroup, so that the original memory cgroup can be released prior to all objects, which has been charged to it. Because all charges and statistics are fully recursive, it's perfectly correct to uncharge the parent cgroup instead. This scheme is used in the slab memory accounting, and percpu memory can just follow the scheme. This patch (of 5): To implement accounting of percpu memory we need the information about the size of freed object. Return it from pcpu_free_area(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> cC: Michal Koutnýutny@suse.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-1/+1
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-02mm: remove the pgprot argument to __vmallocChristoph Hellwig1-1/+1
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv] Acked-by: Gao Xiang <xiang@kernel.org> [erofs] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07percpu: make pcpu_alloc() aware of current gfp contextFilipe Manana1-4/+10
Since 5.7-rc1, on btrfs we have a percpu counter initialization for which we always pass a GFP_KERNEL gfp_t argument (this happens since commit 2992df73268f78 ("btrfs: Implement DREW lock")). That is safe in some contextes but not on others where allowing fs reclaim could lead to a deadlock because we are either holding some btrfs lock needed for a transaction commit or holding a btrfs transaction handle open. Because of that we surround the call to the function that initializes the percpu counter with a NOFS context using memalloc_nofs_save() (this is done at btrfs_init_fs_root()). However it turns out that this is not enough to prevent a possible deadlock because percpu_alloc() determines if it is in an atomic context by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this case) and it is not aware that a NOFS context is set. Because percpu_alloc() thinks it is in a non atomic context it locks the pcpu_alloc_mutex. This can result in a btrfs deadlock when pcpu_balance_workfn() is running, has acquired that mutex and is waiting for reclaim, while the btrfs task that called percpu_counter_init() (and therefore percpu_alloc()) is holding either the btrfs commit_root semaphore or a transaction handle (done fs/btrfs/backref.c: iterate_extent_inodes()), which prevents reclaim from finishing as an attempt to commit the current btrfs transaction will deadlock. Lockdep reports this issue with the following trace: ====================================================== WARNING: possible circular locking dependency detected 5.6.0-rc7-btrfs-next-77 #1 Not tainted ------------------------------------------------------ kswapd0/91 is trying to acquire lock: ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] but task is already holding lock: ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (fs_reclaim){+.+.}: fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc+0x5f/0x3a0 pcpu_create_chunk+0x19/0x230 pcpu_balance_workfn+0x56a/0x680 process_one_work+0x235/0x5f0 worker_thread+0x50/0x3b0 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 -> #3 (pcpu_alloc_mutex){+.+.}: __mutex_lock+0xa9/0xaf0 pcpu_alloc+0x480/0x7c0 __percpu_counter_init+0x50/0xd0 btrfs_drew_lock_init+0x22/0x70 [btrfs] btrfs_get_fs_root+0x29c/0x5c0 [btrfs] resolve_indirect_refs+0x120/0xa30 [btrfs] find_parent_nodes+0x50b/0xf30 [btrfs] btrfs_find_all_leafs+0x60/0xb0 [btrfs] iterate_extent_inodes+0x139/0x2f0 [btrfs] iterate_inodes_from_logical+0xa1/0xe0 [btrfs] btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs] btrfs_ioctl+0x165a/0x3130 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #2 (&fs_info->commit_root_sem){++++}: down_write+0x38/0x70 btrfs_cache_block_group+0x2ec/0x500 [btrfs] find_free_extent+0xc6a/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] commit_cowonly_roots+0x55/0x310 [btrfs] btrfs_commit_transaction+0x509/0xb20 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x93/0xc0 exit_to_usermode_loop+0xf9/0x100 do_syscall_64+0x20d/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&space_info->groups_sem){++++}: down_read+0x3c/0x140 find_free_extent+0xef6/0x1600 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x5a0 [btrfs] btrfs_cow_block+0x106/0x240 [btrfs] btrfs_search_slot+0x50c/0xd60 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] __btrfs_update_delayed_inode+0x90/0x280 [btrfs] __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs] __btrfs_run_delayed_items+0x8e/0x180 [btrfs] btrfs_commit_transaction+0x31b/0xb20 [btrfs] iterate_supers+0x87/0xf0 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (&delayed_node->mutex){+.+.}: __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(pcpu_alloc_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/91: #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0 #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8f/0xd0 check_noncircular+0x170/0x190 __lock_acquire+0xef0/0x1c80 lock_acquire+0xa2/0x1d0 __mutex_lock+0xa9/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x40d/0x560 [btrfs] evict+0xd9/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x124/0x1a0 do_shrink_slab+0x176/0x440 shrink_slab+0x23a/0x2c0 shrink_node+0x188/0x6e0 balance_pgdat+0x31d/0x7f0 kswapd+0x238/0x550 kthread+0x120/0x140 ret_from_fork+0x3a/0x50 This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL to percpu_counter_init() in contextes where it is not reclaim safe, however that type of approach is discouraged since memalloc_[nofs|noio]_save() were introduced. Therefore this change makes pcpu_alloc() look up into an existing nofs/noio context before deciding whether it is in an atomic context or not. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01percpu: update copyright emails to dennis@kernel.orgDennis Zhou1-1/+1
Currently there are 3 emails tied to me in the kernel tree, I'd rather dennis@kernel.org be the only one. Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-01-20bitmap: genericize percpu bitmap region iteratorsDennis Zhou1-45/+16
Bitmaps are fairly popular for their space efficiency, but we don't have generic iterators available. Make percpu's bitmap region iterators available to everyone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-04percpu: Use struct_size() helperGustavo A. R. Silva1-1/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct pcpu_alloc_info { ... struct pcpu_group_info groups[]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. So, replace the following form: sizeof(*ai) + nr_groups * sizeof(ai->groups[0]) with: struct_size(ai, groups, nr_groups) This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-07-23percpu: fix typo in pcpu_setup_first_chunk() commentChristophe JAILLET1-1/+1
s/perpcu/percpu/ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Dennis Zhou <dennis@kernel.org> [Dennis: updated title]
2019-07-04percpu: Make pcpu_setup_first_chunk() void functionKefeng Wang1-12/+7
pcpu_setup_first_chunk() will panic or BUG_ON if the are some error and doesn't return any error, hence it can be defined to return void. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> [Dennis: fixed kbuild warning for pcpu_page_first_chunk()]
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428Thomas Gleixner1-2/+1
Based on 1 normalized pattern(s): this file is released under the gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 68 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-13Merge branch 'for-5.2' of ↵Linus Torvalds1-162/+387
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu updates from Dennis Zhou: - scan hint update which helps address performance issues with heavily fragmented blocks - lockdep fix when freeing an allocation causes balance work to be scheduled * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: remove spurious lock dependency between percpu and sched percpu: use chunk scan_hint to skip some scanning percpu: convert chunk hints to be based on pcpu_block_md percpu: make pcpu_block_md generic percpu: use block scan_hint to only scan forward percpu: remember largest area skipped during allocation percpu: add block level scan_hint percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE percpu: relegate chunks unusable when failing small allocations percpu: manage chunks based on contig_bits instead of free_bytes percpu: introduce helper to determine if two regions overlap percpu: do not search past bitmap when allocating an area percpu: update free path with correct new free region
2019-05-08percpu: remove spurious lock dependency between percpu and schedJohn Sperbeck1-1/+5
In free_percpu() we sometimes call pcpu_schedule_balance_work() to queue a work item (which does a wakeup) while holding pcpu_lock. This creates an unnecessary lock dependency between pcpu_lock and the scheduler's pi_lock. There are other places where we call pcpu_schedule_balance_work() without hold pcpu_lock, and this case doesn't need to be different. Moving the call outside the lock prevents the following lockdep splat when running tools/testing/selftests/bpf/{test_maps,test_progs} in sequence with lockdep enabled: ====================================================== WARNING: possible circular locking dependency detected 5.1.0-dbg-DEV #1 Not tainted ------------------------------------------------------ kworker/23:255/18872 is trying to acquire lock: 000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520 but task is already holding lock: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (pcpu_lock){..-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 pcpu_alloc+0xfa/0x780 __alloc_percpu_gfp+0x12/0x20 alloc_htab_elem+0x184/0x2b0 __htab_percpu_map_update_elem+0x252/0x290 bpf_percpu_hash_update+0x7c/0x130 __do_sys_bpf+0x1912/0x1be0 __x64_sys_bpf+0x1a/0x20 do_syscall_64+0x59/0x400 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #3 (&htab->buckets[i].lock){....}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 htab_map_update_elem+0x1af/0x3a0 -> #2 (&rq->lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 task_fork_fair+0x37/0x160 sched_fork+0x211/0x310 copy_process.part.43+0x7b1/0x2160 _do_fork+0xda/0x6b0 kernel_thread+0x29/0x30 rest_init+0x22/0x260 arch_call_rest_init+0xe/0x10 start_kernel+0x4fd/0x520 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x6f/0x72 secondary_startup_64+0xa4/0xb0 -> #1 (&p->pi_lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 try_to_wake_up+0x41/0x600 wake_up_process+0x15/0x20 create_worker+0x16b/0x1e0 workqueue_init+0x279/0x2ee kernel_init_freeable+0xf7/0x288 kernel_init+0xf/0x180 ret_from_fork+0x24/0x30 -> #0 (&(&pool->lock)->rlock){-.-.}: __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 other info that might help us debug this: Chain exists of: &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_lock); lock(&htab->buckets[i].lock); lock(pcpu_lock); lock(&(&pool->lock)->rlock); *** DEADLOCK *** 3 locks held by kworker/23:255/18872: #0: 00000000b36a6e16 ((wq_completion)events){+.+.}, at: process_one_work+0x17a/0x580 #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x17a/0x580 #2: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 stack backtrace: CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1 Hardware name: ... Workqueue: events bpf_map_free_deferred Call Trace: dump_stack+0x67/0x95 print_circular_bug.isra.38+0x1c6/0x220 check_prev_add.constprop.50+0x9f6/0xd20 __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-18percpu: stop printing kernel addressesMatteo Croce1-4/+4
Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"), at boot "____ptrval____" is printed instead of actual addresses: percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288 Instead of changing the print to "%px", and leaking kernel addresses, just remove the print completely, cfr. e.g. commit 071929dbdd865f77 ("arm64: Stop printing the virtual memory layout"). Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13percpu: use chunk scan_hint to skip some scanningDennis Zhou1-9/+27
Just like blocks, chunks now maintain a scan_hint. This can be used to skip some scanning by promoting the scan_hint to be the contig_hint. The chunk's scan_hint is primarily updated on the backside and relies on full scanning when a block becomes free or the free region spans across blocks. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: convert chunk hints to be based on pcpu_block_mdDennis Zhou1-67/+53
As mentioned in the last patch, a chunk's hints are no different than a block just responsible for more bits. This converts chunk level hints to use a pcpu_block_md to maintain them. This lets us reuse the same hint helper functions as a block. The left_free and right_free are unused by the chunk's pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: make pcpu_block_md genericDennis Zhou1-7/+13
In reality, a chunk is just a block covering a larger number of bits. The hints themselves are one in the same. Rather than maintaining the hints separately, first introduce nr_bits to genericize pcpu_block_update() to correctly maintain block->right_free. The next patch will convert chunk hints to be managed as a pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: use block scan_hint to only scan forwardDennis Zhou1-6/+17
Blocks now remember the latest scan_hint. This can be used on the allocation path as when a contig_hint is broken, we can promote the scan_hint to the contig_hint and scan forward from there. This works because pcpu_block_refresh_hint() is only called on the allocation path while block free regions are updated manually in pcpu_block_update_hint_free(). Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13percpu: remember largest area skipped during allocationDennis Zhou1-2/+99
Percpu allocations attempt to do first fit by scanning forward from the first_free of a block. However, fragmentation from allocation requests can cause holes not seen by block hint update functions. To address this, create a local version of bitmap_find_next_zero_area_off() that remembers the largest area skipped over. The caveat is that it only sees regions skipped over due to not fitting, not regions skipped due to alignment. Prior to updating the scan_hint, a scan backwards is done to try and recover free bits skipped due to alignment. While this can cause scanning to miss earlier possible free areas, smaller allocations will eventually fill those holes due to first fit. Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13percpu: add block level scan_hintDennis Zhou1-7/+94
Fragmentation can cause both blocks and chunks to have an early first_firee bit available, but only able to satisfy allocations much later on. This patch introduces a scan_hint to help mitigate some unnecessary scanning. The scan_hint remembers the largest area prior to the contig_hint. If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start. This is necessary for scan_hint discovery when refreshing a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZEDennis Zhou1-66/+44
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacked off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: relegate chunks unusable when failing small allocationsDennis Zhou1-9/+26
In certain cases, requestors of percpu memory may want specific alignments. However, it is possible to end up in situations where the contig_hint matches, but the alignment does not. This causes excess scanning of chunks that will fail. To prevent this, if a small allocation fails (< 32B), the chunk is moved to the empty list. Once an allocation is freed from that chunk, it is placed back into rotation. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: manage chunks based on contig_bits instead of free_bytesDennis Zhou1-1/+1
When a chunk becomes fragmented, it can end up having a large number of small allocation areas free. The free_bytes sorting of chunks leads to unnecessary checking of chunks that cannot satisfy the allocation. Switch to contig_bits sorting to prevent scanning chunks that may not be able to service the allocation request. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: introduce helper to determine if two regions overlapDennis Zhou1-4/+24
While block hints were always accurate, it's possible when spanning across blocks that we miss updating the chunk's contig_hint. Rather than rely on correctness of the boundaries of hints, do a full overlap comparison. A future patch introduces the scan_hint which makes the contig_hint slightly fuzzy as they can at times be smaller than the actual hint. Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13percpu: do not search past bitmap when allocating an areaDennis Zhou1-1/+2
pcpu_find_block_fit() guarantees that a fit is found within PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it compares against the block's contig_hint. This can lead to incorrectly scanning past the end of the bitmap. The behavior was okay given the check after for bit_off >= end and the correctness of the hints from pcpu_find_block_fit(). This patch fixes this by bounding the end offset by the number of bits in a chunk. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13percpu: update free path with correct new free regionDennis Zhou1-1/+1
When updating the chunk's contig_hint on the free path of a hint that does not touch the page boundaries, it was incorrectly using the starting offset of the free region and the block's contig_hint. This could lead to incorrect assumptions about fit given a size and better alignment of the start. Fix this by using (end - start) as this is only called when updating a hint within a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-12memblock: drop memblock_alloc_*_nopanic() variantsMike Rapoport1-7/+4
As all the memblock allocation functions return NULL in case of error rather than panic(), the duplicates with _nopanic suffix can be removed. Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Petr Mladek <pmladek@suse.com> [printk] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12mm/percpu: add checks for the return value of memblock_alloc*()Mike Rapoport1-17/+56
Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-23percpu: use nr_groups as check conditionPeng Fan1-1/+1
group_cnt array is defined with NR_CPUS entries, but normally nr_groups will not reach up to NR_CPUS. So there is no issue to the current code. Checking other parts of pcpu_build_alloc_info, use nr_groups as check condition, so make it consistent to use 'group < nr_groups' as for loop check. In case we do have nr_groups equals with NR_CPUS, we could also avoid memory access out of bounds. Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-11-01Merge branch 'for-4.20' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu fixes from Dennis Zhou: "Two small things for v4.20. The first fixes a clang uninitialized variable warning for arm64 in the default path calls BUILD_BUG(). The second removes an unnecessary unlikely() in a WARN_ON() use" * 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: arm64: percpu: Initialize ret in the default case mm: percpu: remove unnecessary unlikely()
2018-10-31memblock: stop using implicit alignment to SMP_CACHE_BYTESMike Rapoport1-18/+20
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport1-1/+1
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31memblock: remove _virt from APIs returning virtual addressMike Rapoport1-14/+14
The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-07percpu: stop leaking bitmap metadata blocksMike Rapoport1-0/+1
The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks") introduced bitmap metadata blocks. These metadata blocks are allocated whenever a new chunk is created, but they are never freed. Fix it. Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-09-12mm: percpu: remove unnecessary unlikely()Igor Stoppa1-1/+1
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it into another. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: zijun_hu <zijun_hu@htc.com> Cc: Christoph Lameter <cl@linux.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-08-22/proc/meminfo: add percpu populated pages countDennis Zhou (Facebook)1-0/+29
Currently, percpu memory only exposes allocation and utilization information via debugfs. This more or less is only really useful for understanding the fragmentation and allocation information at a per-chunk level with a few global counters. This is also gated behind a config. BPF and cgroup, for example, have seen an increase in use causing increased use of percpu memory. Let's make it easier for someone to identify how much memory is being used. This patch adds the "Percpu" stat to meminfo to more easily look up how much percpu memory is in use. This number includes the cost for all allocated backing pages and not just insight at the per a unit, per chunk level. Metadata is excluded. I think excluding metadata is fair because the backing memory scales with the numbere of cpus and can quickly outweigh the metadata. It also makes this calculation light. Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-02Merge tag 'arch-removal' of ↵Linus Torvalds1-4/+0
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pul removal of obsolete architecture ports from Arnd Bergmann: "This removes the entire architecture code for blackfin, cris, frv, m32r, metag, mn10300, score, and tile, including the associated device drivers. I have been working with the (former) maintainers for each one to ensure that my interpretation was right and the code is definitely unused in mainline kernels. Many had fond memories of working on the respective ports to start with and getting them included in upstream, but also saw no point in keeping the port alive without any users. In the end, it seems that while the eight architectures are extremely different, they all suffered the same fate: There was one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem, which was more costly than licensing newer off-the-shelf CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems that all the SoC product lines are still around, but have not used the custom CPU architectures for several years at this point. In contrast, CPU instruction sets that remain popular and have actively maintained kernel ports tend to all be used across multiple licensees. [ See the new nds32 port merged in the previous commit for the next generation of "one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem" - Linus ] The removal came out of a discussion that is now documented at https://lwn.net/Articles/748074/. Unlike the original plans, I'm not marking any ports as deprecated but remove them all at once after I made sure that they are all unused. Some architectures (notably tile, mn10300, and blackfin) are still being shipped in products with old kernels, but those products will never be updated to newer kernel releases. After this series, we still have a few architectures without mainline gcc support: - unicore32 and hexagon both have very outdated gcc releases, but the maintainers promised to work on providing something newer. At least in case of hexagon, this will only be llvm, not gcc. - openrisc, risc-v and nds32 are still in the process of finishing their support or getting it added to mainline gcc in the first place. They all have patched gcc-7.3 ports that work to some degree, but complete upstream support won't happen before gcc-8.1. Csky posted their first kernel patch set last week, their situation will be similar [ Palmer Dabbelt points out that RISC-V support is in mainline gcc since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]" This really says it all: 2498 files changed, 95 insertions(+), 467668 deletions(-) * tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits) MAINTAINERS: UNICORE32: Change email account staging: iio: remove iio-trig-bfin-timer driver tty: hvc: remove tile driver tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers serial: remove tile uart driver serial: remove m32r_sio driver serial: remove blackfin drivers serial: remove cris/etrax uart drivers usb: Remove Blackfin references in USB support usb: isp1362: remove blackfin arch glue usb: musb: remove blackfin port usb: host: remove tilegx platform glue pwm: remove pwm-bfin driver i2c: remove bfin-twi driver spi: remove blackfin related host drivers watchdog: remove bfin_wdt driver can: remove bfin_can driver mmc: remove bfin_sdh driver input: misc: remove blackfin rotary driver input: keyboard: remove bf54x driver ...
2018-03-26treewide: simplify Kconfig dependencies for removed archsArnd Bergmann1-4/+0
A lot of Kconfig symbols have architecture specific dependencies. In those cases that depend on architectures we have already removed, they can be omitted. Acked-by: Kalle Valo <kvalo@codeaurora.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-03-19mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()Kirill Tkhai1-2/+11
In case of memory deficit and low percpu memory pages, pcpu_balance_workfn() takes pcpu_alloc_mutex for a long time (as it makes memory allocations itself and waits for memory reclaim). If tasks doing pcpu_alloc() are choosen by OOM killer, they can't exit, because they are waiting for the mutex. The patch makes pcpu_alloc() to care about killing signal and use mutex_lock_killable(), when it's allowed by GFP flags. This guarantees, a task does not miss SIGKILL from OOM killer. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-03-19percpu: include linux/sched.h for cond_resched()Tejun Heo1-0/+1
microblaze build broke due to missing declaration of the cond_resched() invocation added recently. Let's include linux/sched.h explicitly. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kbuild test robot <fengguang.wu@intel.com>
2018-02-23percpu: add a schedule point in pcpu_balance_workfn()Eric Dumazet1-0/+1
When a large BPF percpu map is destroyed, I have seen pcpu_balance_workfn() holding cpu for hundreds of milliseconds. On KASAN config and 112 hyperthreads, average time to destroy a chunk is ~4 ms. [ 2489.841376] destroy chunk 1 in 4148689 ns ... [ 2490.093428] destroy chunk 32 in 4072718 ns Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-02-18percpu: allow select gfp to be passed to underlying allocatorsDennis Zhou1-9/+7
The prior patch added support for passing gfp flags through to the underlying allocators. This patch allows users to pass along gfp flags (currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying allocators. This should allow users to decide if they are ok with failing allocations recovering in a more graceful way. Additionally, gfp passing was done as additional flags in the previous patch. Instead, change this to caller passed semantics. GFP_KERNEL is also removed as the default flag. It continues to be used for internally caused underlying percpu allocations. V2: Removed gfp_percpu_mask in favor of doing it inline. Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-02-18percpu: add __GFP_NORETRY semantics to the percpu balancing pathDennis Zhou1-17/+27
Percpu memory using the vmalloc area based chunk allocator lazily populates chunks by first requesting the full virtual address space required for the chunk and subsequently adding pages as allocations come through. To ensure atomic allocations can succeed, a workqueue item is used to maintain a minimum number of empty pages. In certain scenarios, such as reported in [1], it is possible that physical memory becomes quite scarce which can result in either a rather long time spent trying to find free pages or worse, a kernel panic. This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them through to the underlying allocators. This should prevent any unnecessary panics potentially caused by the workqueue item. The passing of gfp around is as additional flags rather than a full set of flags. The next patch will change these to caller passed semantics. V2: Added const modifier to gfp flags in the balance path. Removed an extra whitespace. [1] https://lkml.org/lkml/2018/2/12/551 Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-02-18percpu: match chunk allocator declarations with definitionsDennis Zhou1-2/+4
At some point the function declaration parameters got out of sync with the function definitions in percpu-vm.c and percpu-km.c. This patch makes them match again. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-27percpu: hack to let the CRIS architecture to boot until they clean upNicolas Pitre1-0/+4
Commit 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info") uncovered a problem on the CRIS architecture where the bootmem allocator is initialized with virtual addresses. Given it has: #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) then things just work out because the end result is the same whether you give this a physical or a virtual address. Untill you call memblock_free_early(__pa(address)) that is, because values from __pa() don't match with the virtual addresses stuffed in the bootmem allocator anymore. Avoid freeing the temporary pcpu_alloc_info memory on that architecture until they fix things up to let the kernel boot like it did before. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info")
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "Another minor pull request. It only contains one commit which can reclaim a bit of memory wasted during boot on UP" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: don't forget to free the temporary struct pcpu_alloc_info
2017-10-19mm, percpu: add support for __GFP_NOWARN flagDaniel Borkmann1-5/+10
Add an option for pcpu_alloc() to support __GFP_NOWARN flag. Currently, we always throw a warning when size or alignment is unsupported (and also dump stack on failed allocation requests). The warning itself is harmless since we return NULL anyway for any failed request, which callers are required to handle anyway. However, it becomes harmful when panic_on_warn is set. The rationale for the WARN() in pcpu_alloc() is that it can be tracked when larger than supported allocation requests are made such that allocations limits can be tweaked if warranted. This makes sense for in-kernel users, however, there are users of pcpu allocator where allocation size is derived from user space requests, e.g. when creating BPF maps. In these cases, the requests should fail gracefully without throwing a splat. The current work-around was to check allocation size against the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for bailing out prior to a call to pcpu_alloc() in order to avoid throwing the WARN(). This is bad in multiple ways since PCPU_MIN_UNIT_SIZE is an implementation detail, and having the checks on call-sites only complicates the code for no good reason. Thus, lets fix it generically by supporting the __GFP_NOWARN flag that users can then use with calling the __alloc_percpu_gfp() helper instead. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04percpu: don't forget to free the temporary struct pcpu_alloc_infoNicolas Pitre1-1/+2
Unlike the SMP case, the !SMP case does not free the memory for struct pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a chance of being reused by the page allocator later, align it to a page boundary just like its size. Signed-off-by: Nicolas Pitre <nico@linaro.org> Acked-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-09-28percpu: fix iteration to prevent skipping over blockDennis Zhou1-0/+4
The iterator functions pcpu_next_md_free_region and pcpu_next_fit_region use the block offset to determine if they have checked the area in the prior iteration. However, this causes an issue when the block offset is greater than subsequent block contig hints. If within the iterator it moves to check subsequent blocks, it may fail in the second predicate due to the block offset not being cleared. Thus, this causes the allocator to skip over blocks leading to false failures when allocating from the reserved chunk. While this happens in the general case as well, it will only fail if it cannot allocate a new chunk. This patch resets the block offset to 0 to pass the second predicate when checking subseqent blocks within the iterator function. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reported-and-tested-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: update header to contain bitmap allocator explanation.Dennis Zhou (Facebook)1-14/+18
The other patches contain a lot of information, so adding this information in a separate patch. It adds my copyright and a brief explanation of how the bitmap allocator works. There is a minor typo as well in the prior explanation so that is fixed. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: update pcpu_find_block_fit to use an iteratorDennis Zhou (Facebook)1-20/+92
The simple, and expensive, way to find a free area is to iterate over the entire bitmap until an area is found that fits the allocation size and alignment. This patch makes use of an iterate that find an area to check by using the block level contig hints. It will only return an area that can fit the size and alignment request. If the request can fit inside a block, it returns the first_free bit to start checking from to see if it can be fulfilled prior to the contig hint. The pcpu_alloc_area check has a bound of a block size added in case it is wrong. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: use metadata blocks to update the chunk contig hintDennis Zhou (Facebook)1-10/+70
The largest free region will either be a block level contig hint or an aggregate over the left_free and right_free areas of blocks. This is a much smaller set of free areas that need to be checked than a full traverse. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: update free path to take advantage of contig hintsDennis Zhou (Facebook)1-3/+65
The bitmap allocator must keep metadata consistent. The easiest way is to scan after every allocation for each affected block and the entire chunk. This is rather expensive. The free path can take advantage of current contig hints to prevent scanning within the start and end block. If a scan is needed, it can be done by scanning backwards from the start and forwards from the end to identify the entire free area this can be combined with. The blocks can then be updated by some basic checks rather than complete block scans. A chunk scan happens when the freed area makes a page free, a block free, or spans across blocks. This is necessary as the contig hint at this point could span across blocks. The check uses the minimum of page size and the block size to allow for variable sized blocks. There is a tradeoff here with not updating after every free. It is possible a contig hint in one block can be merged with the contig hint in the next block. This means the contig hint can be off by up to a page. However, if the chunk's contig hint is contained in one block, the contig hint will be accurate. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: update alloc path to only scan if contig hints are brokenDennis Zhou (Facebook)1-3/+56
Metadata is kept per block to keep track of where the contig hints are. Scanning can be avoided when the contig hints are not broken. In that case, left and right contigs have to be managed manually. This patch changes the allocation path hint updating to only scan when contig hints are broken. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: keep track of the best offset for contig hintsDennis Zhou (Facebook)1-1/+12
This patch makes the contig hint starting offset optimization from the previous patch as honest as it can be. For both chunk and block starting offsets, make sure it keeps the starting offset with the best alignment. The block skip optimization is added in a later patch when the pcpu_find_block_fit iterator is swapped in. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: skip chunks if the alloc does not fit in the contig hintDennis Zhou (Facebook)1-2/+16
This patch adds chunk->contig_bits_start to keep track of the contig hint's offset and the check to skip the chunk if it does not fit. If the chunk's contig hint starting offset cannot satisfy an allocation, the allocator assumes there is enough memory pressure in this chunk to either use a different chunk or create a new one. This accepts a less tight packing for a smoother latency curve. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: add first_bit to keep track of the first free in the bitmapDennis Zhou (Facebook)1-2/+15
This patch adds first_bit to keep track of the first free bit in the bitmap. This hint helps prevent scanning of fully allocated blocks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce bitmap metadata blocksDennis Zhou (Facebook)1-12/+216
This patch introduces the bitmap metadata blocks and adds the skeleton of the code that will be used to maintain these blocks. Each chunk's bitmap is made up of full metadata blocks. These blocks maintain basic metadata to help prevent scanning unnecssarily to update hints. Full scanning methods are used for the skeleton and will be replaced in the coming patches. A number of helper functions are added as well to do conversion of pages to blocks and manage offsets. Comments will be updated as the final version of each function is added. There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE, the region size, and unit_size. Every chunk's region (including offsets) is page aligned at the beginning to preserve alignment. The end is aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end can fit with the populated page map which is by page and every metadata block is fully accounted for. The unit_size is already page aligned, but must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata blocks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: replace area map allocator with bitmapDennis Zhou (Facebook)1-456/+273
The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: generalize bitmap (un)populated iteratorsDennis Zhou (Facebook)1-24/+25
The area map allocator only used a bitmap for the backing page state. The new bitmap allocator will use bitmaps to manage the allocation region in addition to this. This patch generalizes the bitmap iterators so they can be reused with the bitmap allocator. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: increase minimum percpu allocation size and align first regionsDennis Zhou (Facebook)1-7/+20
This patch increases the minimum allocation size of percpu memory to 4-bytes. This change will help minimize the metadata overhead associated with the bitmap allocator. The assumption is that most allocations will be of objects or structs greater than 2 bytes with integers or longs being used rather than shorts. The first chunk regions are now aligned with the minimum allocation size. The reserved region is expected to be set as a multiple of the minimum allocation size. The static region is aligned up and the delta is removed from the dynamic size. This works because the dynamic size is increased to be page aligned. If the static size is not minimum allocation size aligned, then there must be a gap that is added to the dynamic size. The dynamic size will never be smaller than the set value. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce nr_empty_pop_pages to help empty page accountingDennis Zhou (Facebook)1-3/+8
pcpu_nr_empty_pop_pages is used to ensure there are a handful of free pages around to serve atomic allocations. A new field, nr_empty_pop_pages, is added to the pcpu_chunk struct to keep track of the number of empty pages. This field is needed as the number of empty populated pages is globally tracked and deltas are used to update in the bitmap allocator. Pages that contain a hidden area are not considered to be empty. This new field is exposed in percpu_stats. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: change the number of pages marked in the first_chunk pop bitmapDennis Zhou (Facebook)1-7/+9
The populated bitmap represents the state of the pages the chunk serves. Prior, the bitmap was marked completely used as the first chunk was allocated and immutable. This is misleading because the first chunk may not be completely filled. Additionally, with moving the base_addr up in the previous patch, the population check no longer corresponds to what was being checked. This patch modifies the population map to be only the number of pages the region serves and to make what it was checking correspond correctly again. The change is to remove any misunderstanding between the size of the populated bitmap and the actual size of it. The work function page iterators now use nr_pages for the check rather than pcpu_unit_pages because nr_populated is now chunk specific. Without this, the work function would try to populate the remainder of these chunks despite it not serving any more than nr_pages when nr_pages is set less than pcpu_unit_pages. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: combine percpu address checksDennis Zhou (Facebook)1-40/+11
The percpu address checks for the reserved and dynamic region chunks are now specific to each region. The address checking logic can be combined taking advantage of the global references to the dynamic and static region chunks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: modify base_addr to be region specificDennis Zhou (Facebook)1-41/+114
Originally, the first chunk was served by one or two chunks, each given a region they are responsible for. Despite this, the arithmetic was based off of the true base_addr of the chunk making it be overly inclusive. This patch moves the base_addr of chunks that are responsible for the first chunk. The base_addr must remain page aligned to keep the address alignment correct, so it is the beginning of the region served page aligned down. start_offset holds where the region served begins from this new base_addr. The corresponding percpu address checks are modified to be more specific as a result. The first chunk considers only the dynamic region and both first chunk and reserved chunk checks ignore the static region. The static region addresses should never be passed into the allocator. There is no impact here besides distinguishing the first chunk and making the checks specific. The percpu pointer to physical address is left intact as addresses are not given out in the non-allocated portion of percpu memory. nr_pages is added to pcpu_chunk to keep track of the size of the entire region served containing both start_offset and end_offset. This variable will be used to manage the bitmap allocator. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk rename schunk/dchunk to chunkDennis Zhou (Facebook)1-8/+8
There is no need to have the static chunk and dynamic chunk be named separately as the allocations are sequential. This preemptively solves the misnomer problem with the base_addrs being moved up in the following patch. It also removes a ternary operation deciding the first chunk. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: end chunk area maps page aligned for the populated bitmapDennis Zhou (Facebook)1-0/+9
The area map allocator manages the first chunk area by hiding all but the region it is responsible for serving in the area map. To align this with the populated page bitmap, end_offset is introduced to keep track of the delta to end page aligned. The area map is appended with the page aligned end when necessary to be in line with how the bitmap allocator requires the ending to be aligned with the LCM of PAGE_SIZE and the size of each bitmap block. percpu_stats is updated to ignore this region when present. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: unify allocation of schunk and dchunkDennis Zhou (Facebook)1-33/+40
Create a common allocator for first chunk initialization, pcpu_alloc_first_chunk. Comments for this function will be added in a later patch once the bitmap allocator is added. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk remove dyn_size and consolidate logicDennis Zhou (Facebook)1-12/+6
There is logic for setting variables in the static chunk init code that could be consolidated with the dynamic chunk init code. This combines this logic to setup for combining the allocation paths. reserved_size is used as the conditional as a dynamic region will always exist. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: remove has_reserved from pcpu_chunkDennis Zhou (Facebook)1-3/+0
Prior this variable was used to manage statistics when the first chunk had a reserved region. The previous patch introduced start_offset to keep track of the offset by value rather than boolean. Therefore, has_reserved can be removed. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce start_offset to pcpu_chunkDennis Zhou (Facebook)1-11/+10
The reserved chunk arithmetic uses a global variable pcpu_reserved_chunk_limit that is set in the first chunk init code to hide a portion of the area map. The bitmap allocator to come will eventually move the base_addr up and require both the reserved chunk and static chunk to maintain this offset. pcpu_reserved_chunk_limit is removed and start_offset is added. The first chunk that is circulated and is pcpu_first_chunk serves the dynamic region, the region following the reserved region. The reserved chunk address check will temporarily use the first chunk to identify its address range. A following patch will increase the base_addr and remove this. If there is no reserved chunk, this will check the static region and return false because those values should never be passed into the allocator. Lastly, when linking in the first chunk, make sure to count the right free region for the number of empty populated pages. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk enforce dynamic region must existDennis Zhou (Facebook)1-5/+4
The first chunk is handled as a special case as it is composed of the static, reserved, and dynamic regions. The code handles each case individually. The next several patches will merge these code paths and lay the foundation for the bitmap allocator. This patch modifies logic to enforce that a dynamic region exists and changes the area map to account for that. This brings the logic closer to the dynamic chunk's init logic. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17percpu: update the header comment and pcpu_build_alloc_info commentsDennis Zhou (Facebook)1-26/+32
The header comment for percpu memory is a little hard to parse and is not super clear about how the first chunk is managed. This adds a little more clarity to the situation. There is also quite a bit of tricky logic in the pcpu_build_alloc_info. This adds a restructure of a comment to add a little more information. Unfortunately, you will still have to piece together a handful of other comments too, but should help direct you to the meaningful comments. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17percpu: expose pcpu_nr_empty_pop_pages in pcpu_statsDennis Zhou (Facebook)1-1/+1
Percpu memory holds a minimum threshold of pages that are populated in order to serve atomic percpu memory requests. This change makes it easier to verify that there are a minimum number of populated pages lying around. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-21percpu: resolve err may not be initialized in pcpu_allocDennis Zhou1-1/+3
From 4a42ecc735cff0015cc73c3d87edede631f4b885 Mon Sep 17 00:00:00 2001 From: Dennis Zhou <dennisz@fb.com> Date: Wed, 21 Jun 2017 08:07:15 -0700 Add error message to out of space failure for atomic allocations in percpu allocation path to fix -Wmaybe-uninitialized. Signed-off-by: Dennis Zhou <dennisz@fb.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: add tracepoint support for percpu memoryDennis Zhou1-0/+12
Add support for tracepoints to the following events: chunk allocation, chunk free, area allocation, area free, and area allocation failure. This should let us replay percpu memory requests and evaluate corresponding decisions. Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: expose statistics about percpu memory via debugfsDennis Zhou1-0/+9
There is limited visibility into the use of percpu memory leaving us unable to reason about correctness of parameters and overall use of percpu memory. These counters and statistics aim to help understand basic statistics about percpu memory such as number of allocations over the lifetime, allocation sizes, and fragmentation. New Config: PERCPU_STATS Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: migrate percpu data structures to internal headerDennis Zhou1-23/+7
Migrates pcpu_chunk definition and a few percpu static variables to an internal header file from mm/percpu.c. These will be used with debugfs to expose statistics about percpu memory improving visibility regarding allocations and fragmentation. Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-20percpu: add missing lockdep_assert_held to func pcpu_free_areaDennis Zhou1-0/+2
Add a missing lockdep_assert_held for pcpu_lock to improve consistency and safety throughout mm/percpu.c. Signed-off-by: Dennis Zhou <dennisz@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-10mark most percpu globals as __ro_after_initDaniel Micay1-18/+18
Moving pcpu_base_addr to this section comes from PaX where it's part of KERNEXEC. This extends it to the rest of the globals only written by the init code. Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-04Merge branch 'sched/core' into locking/coreThomas Gleixner1-1/+4
Required for the rtmutex/sched_deadline patches which depend on both branches
2017-03-26lockdep: Fix per-cpu static objectsPeter Zijlstra1-1/+4
Since commit 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") we try to collapse per-cpu locks into a single class by giving them all the same key. For this key we choose the canonical address of the per-cpu object, which would be the offset into the per-cpu area. This has two problems: - there is a case where we run !0 lock->key through static_obj() and expect this to pass; it doesn't for canonical pointers. - 0 is a valid canonical address. Cure both issues by redefining the canonical address as the address of the per-cpu variable on the boot CPU. Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at all, track the boot CPU in a variable. Fixes: 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Borislav Petkov <bp@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-mm@kvack.org Cc: wfg@linux.intel.com Cc: kernel test robot <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-16locking/lockdep: Handle statically initialized PER_CPU locks properlyThomas Gleixner1-14/+23
If a PER_CPU struct which contains a spin_lock is statically initialized via: DEFINE_PER_CPU(struct foo, bla) = { .lock = __SPIN_LOCK_UNLOCKED(bla.lock) }; then lockdep assigns a seperate key to each lock because the logic for assigning a key to statically initialized locks is to use the address as the key. With per CPU locks the address is obvioulsy different on each CPU. That's wrong, because all locks should have the same key. To solve this the following modifications are required: 1) Extend the is_kernel/module_percpu_addr() functions to hand back the canonical address of the per CPU address, i.e. the per CPU address minus the per CPU offset. 2) Check the lock address with these functions and if the per CPU check matches use the returned canonical address as the lock key, so all per CPU locks have the same key. 3) Move the static_obj(key) check into look_up_lock_class() so this check can be avoided for statically initialized per CPU locks. That's required because the canonical address fails the static_obj(key) check for obvious reasons. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Merged Dan's fixups for !MODULES and !SMP into this patch. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Murphy <dmurphy@ti.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170227143736.pectaimkjkan5kow@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-06percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pagesTahsin Erdogan1-1/+4
Update to pcpu_nr_empty_pop_pages in pcpu_alloc() is currently done without holding pcpu_lock. This can lead to bad updates to the variable. Add missing lock calls. Fixes: b539b87fed37 ("percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v3.18+
2017-02-27scripts/spelling.txt: add "followings" pattern and fix typo instancesMasahiro Yamada1-1/+1
Fix typos and add the following to the scripts/spelling.txt: followings||following While we are here, add a missing colon in the boilerplate in DT binding documents. The "you SoC" in allwinner,sunxi-pinctrl.txt was fixed as well. I reworded "as the followings:" to "as follows:" for drivers/usb/gadget/udc/renesas_usb3.c. Link: http://lkml.kernel.org/r/1481573103-11329-32-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13Merge branch 'for-4.10' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "This includes just one patch to reject non-power-of-2 alignments and trigger warning. Interestingly, this actually caught a bug in XEN ARM64" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: ensure the requested alignment is power of two
2016-12-12mm/percpu.c: fix panic triggered by BUG_ON() falselyzijun_hu1-4/+12
As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/57FCF07C.2020103@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19percpu: ensure the requested alignment is power of twozijun_hu1-1/+2
The percpu allocator expectedly assumes that the requested alignment is power of two but hasn't been veryfing the input. If the specified alignment isn't power of two, the allocator can malfunction. Add the sanity check. The following is detailed analysis of the effects of alignments which aren't power of two. The alignment must be a even at least since the LSB of a chunk->map element is used as free/in-use flag of a area; besides, the alignment must be a power of 2 too since ALIGN() doesn't work well for other alignment always but is adopted by pcpu_fit_in_area(). IOW, the current allocator only works well for a power of 2 aligned area allocation. See below opposite example for why an odd alignment doesn't work. Let's assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. The larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. However, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover, the residual tail free area [29, 36) is mistook as in-use and is lost silently Unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and the latter result isn't desired obviously. tj: Code style and patch description updates. Signed-off-by: zijun_hu <zijun_hu@htc.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()zijun_hu1-18/+18
in order to ensure the percpu group areas within a chunk aren't distributed too sparsely, pcpu_embed_first_chunk() goes to error handling path when a chunk spans over 3/4 VMALLOC area, however, during the error handling, it forget to free the memory allocated for all percpu groups by going to label @out_free other than @out_free_areas. it will cause memory leakage issue if the rare scene really happens, in order to fix the issue, we check chunk spanned area immediately after completing memory allocation for all percpu groups, we go to label @out_free_areas to free the memory then return if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu <zijun_hu@htc.com> Tested-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()zijun_hu1-6/+8
pcpu_embed_first_chunk() calculates the range a percpu chunk spans into @max_distance and uses it to ensure that a chunk is not too big compared to the total vmalloc area. However, during calculation, it used incorrect top address by adding a unit size to the highest group's base address. This can make the calculated max_distance slightly smaller than the actual distance although given the scale of values involved the error is very unlikely to have an actual impact. Fix this issue by adding the group's size instead of a unit size. BTW, The type of variable max_distance is changed from size_t to unsigned long too based on below consideration: - type unsigned long usually have same width with IP core registers and can be applied at here very well - make @max_distance type consistent with the operand calculated against it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL - type unsigned long is more universal then size_t, size_t is type defined to unsigned int or unsigned long among various ARCHs usually Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-25percpu: fix synchronization between synchronous map extension and chunk ↵Tejun Heo1-8/+8
destruction For non-atomic allocations, pcpu_alloc() can try to extend the area map synchronously after dropping pcpu_lock; however, the extension wasn't synchronized against chunk destruction and the chunk might get freed while extension is in progress. This patch fixes the bug by putting most of non-atomic allocations under pcpu_alloc_mutex to synchronize against pcpu_balance_work which is responsible for async chunk management including destruction. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 1a4d76076cda ("percpu: implement asynchronous chunk population")
2016-05-25percpu: fix synchronization between chunk->map_extend_work and chunk destructionTejun Heo1-21/+36
Atomic allocations can trigger async map extensions which is serviced by chunk->map_extend_work. pcpu_balance_work which is responsible for destroying idle chunks wasn't synchronizing properly against chunk->map_extend_work and may end up freeing the chunk while the work item is still in flight. This patch fixes the bug by rolling async map extension operations into pcpu_balance_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: stable@vger.kernel.org # v3.18+ Fixes: 9c824b6a172c ("percpu: make sure chunk->map array has available space")
2016-03-17mm: percpu: use pr_fmt to prefix outputJoe Perches1-9/+11
Use the normal mechanism to make the logging output consistently "percpu:" instead of a mix of "PERCPU:" and "percpu:" Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: convert printk(KERN_<LEVEL> to pr_<level>Joe Perches1-6/+6
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: coalesce split stringsJoe Perches1-2/+2
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17mm: convert pr_warning to pr_warnJoe Perches1-8/+7
There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn consistently. Miscellanea: - Coalesce formats - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa1-11/+7
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/percpu: use offset_in_page macroAlexander Kuleshov1-5/+5
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-21percpu: clean up of schunk->map[] assignment in pcpu_setup_first_chunkBaoquan He1-3/+2
The original assignment is a little redundent. Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-06-24mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()Larry Finger1-1/+1
Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the following INFO splat is logged: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc7-next-20150612 #1 Not tainted ------------------------------- kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by systemd/1: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40 #1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540 #2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540 stack backtrace: CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 Call Trace: dump_stack+0x4c/0x6e lockdep_rcu_suspicious+0xe7/0x120 ___might_sleep+0x1d5/0x1f0 __might_sleep+0x4d/0x90 kmem_cache_alloc+0x47/0x250 create_object+0x39/0x2e0 kmemleak_alloc_percpu+0x61/0xe0 pcpu_alloc+0x370/0x630 Additional backtrace lines are truncated. In addition, the above splat is followed by several "BUG: sleeping function called from invalid context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau, these are the clue to the fix. Routine kmemleak_alloc_percpu() always uses GFP_KERNEL for its allocations, whereas it should follow the gfp from its callers. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-24percpu: Fix trivial typos in commentsYannick Guerrini1-2/+2
Change 'tranlated' to 'translated' Change 'mutliples' to 'multiples' Signed-off-by: Yannick Guerrini <yguerrini@tomshardware.fr> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-02-13percpu: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo1-4/+2
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29percpu: off by one in BUG_ON()Dan Carpenter1-1/+1
The unit_map[] array has "nr_cpu_ids" number of elements. It's allocated a few lines earlier in the function. So this test should be >= instead of >. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-08percpu: fix how @gfp is interpreted by the percpu allocatorTejun Heo1-1/+1
When @gfp is specified, the percpu allocator is interested in whether it contains all of GFP_KERNEL or not. If it does, the normal allocation path is taken; otherwise, the atomic allocation path. Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp contains any part of GFP_KERNEL. Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of "!(gfp & GFP_KERNEL)" to decide whether the allocation should be atomic or not. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-21Revert "percpu: free percpu allocation info for uniprocessor system"Guenter Roeck1-2/+0
This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system"). The commit causes a hang with a crisv32 image. This may be an architecture problem, but at least for now the revert is necessary to be able to boot a crisv32 image. Cc: Tejun Heo <tj@kernel.org> Cc: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system") Cc: stable@vger.kernel.org # Please don't apply 3189eddbcafc
2014-09-09percpu: fix locking regression in the failure path of pcpu_alloc()Tejun Heo1-0/+1
While updating locking, b38d08f3181c ("percpu: restructure locking") broke pcpu_create_chunk() creation path in pcpu_alloc(). It returns without releasing pcpu_alloc_mutex. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Julia Lawall <julia.lawall@lip6.fr>
2014-09-02percpu: implement asynchronous chunk populationTejun Heo1-4/+113
The percpu allocator now supports atomic allocations by only allocating from already populated areas but the mechanism to ensure that there's adequate amount of populated areas was missing. This patch expands pcpu_balance_work so that in addition to freeing excess free chunks it also populates chunks to maintain an adequate level of populated areas. pcpu_alloc() schedules pcpu_balance_work if the amount of free populated areas is too low or after an atomic allocation failure. * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for PCPU_EMPTY_POP_PAGES_LOW. * pcpu_async_enabled is added to gate both async jobs - chunk->map_extend_work and pcpu_balance_work - so that we don't end up scheduling them while the needed subsystems aren't up yet. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: rename pcpu_reclaim_work to pcpu_balance_workTejun Heo1-15/+12
pcpu_reclaim_work will also be used to populate chunks asynchronously. Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is renamed to pcpu_balance_workfn() and some of its local variables are renamed too. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populatedTejun Heo1-9/+113
pcpu_nr_empty_pop_pages counts the number of empty populated pages across all chunks and chunk->nr_populated counts the number of populated pages in a chunk. Both will be used to implement pre/async population for atomic allocations. pcpu_chunk_[de]populated() are added to update chunk->populated, chunk->nr_populated and pcpu_nr_empty_pop_pages together. All successful chunk [de]populations should be followed by the corresponding pcpu_chunk_[de]populated() calls. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make sure chunk->map array has available spaceTejun Heo1-8/+45
An allocation attempt may require extending chunk->map array which requires GFP_KERNEL context which isn't available for atomic allocations. This patch ensures that chunk->map array usually keeps some amount of available space by directly allocating buffer space during GFP_KERNEL allocations and scheduling async extension during atomic ones. This should make atomic allocation failures from map space exhaustion rare. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implement [__]alloc_percpu_gfp()Tejun Heo1-25/+39
Now that pcpu_alloc_area() can allocate only from populated areas, it's easy to add atomic allocation support to [__]alloc_percpu(). Update pcpu_alloc() so that it accepts @gfp and skips all the blocking operations and allocates only from the populated areas if @gfp doesn't contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp() are added. While this means that atomic allocations are possible, this isn't complete yet as there's no mechanism to ensure that certain amount of populated areas is kept available and atomic allocations may keep failing under certain conditions. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: indent the population block in pcpu_alloc()Tejun Heo1-17/+21
The next patch will conditionalize the population block in pcpu_alloc() which will end up making a rather large indentation change obfuscating the actual logic change. This patch puts the block under "if (true)" so that the next patch can avoid indentation changes. The defintions of the local variables which are used only in the block are moved into the block. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make pcpu_alloc_area() capable of allocating only from populated areasTejun Heo1-7/+58
Update pcpu_alloc_area() so that it can skip unpopulated areas if the new parameter @pop_only is true. This is implemented by a new function, pcpu_fit_in_area(), which determines the amount of head padding considering the alignment and populated state. @pop_only is currently always false but this will be used to implement atomic allocation. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: restructure lockingTejun Heo1-40/+35
At first, the percpu allocator required a sleepable context for both alloc and free paths and used pcpu_alloc_mutex to protect everything. Later, pcpu_lock was introduced to protect the index data structure so that the free path can be invoked from atomic contexts. The conversion only updated what's necessary and left most of the allocation path under pcpu_alloc_mutex. The percpu allocator is planned to add support for atomic allocation and this patch restructures locking so that the coverage of pcpu_alloc_mutex is further reduced. * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new chunk and populating the allocated area. Everything else is now protected soley by pcpu_lock. After this change, multiple instances of pcpu_extend_area_map() may race but the function already implements sufficient synchronization using pcpu_lock. This also allows multiple allocators to arrive at new chunk creation. To avoid creating multiple empty chunks back-to-back, a new chunk is created iff there is no other empty chunk after grabbing pcpu_alloc_mutex. * pcpu_lock is now held while modifying chunk->populated bitmap. After this, all data structures are protected by pcpu_lock. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move region iterations out of pcpu_[de]populate_chunk()Tejun Heo1-11/+8
Previously, pcpu_[de]populate_chunk() were called with the range which may contain multiple target regions in it and pcpu_[de]populate_chunk() iterated over the regions. This has the benefit of batching up cache flushes for all the regions; however, we're planning to add more bookkeeping logic around [de]population to support atomic allocations and this delegation of iterations gets in the way. This patch moves the region iterations out of pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and pcpu_reclaim() - so that we can later add logic to track more states around them. This change may make cache and tlb flushes more frequent but multi-region [de]populations are rare anyway and if this actually becomes a problem, it's not difficult to factor out cache flushes as separate callbacks which are directly invoked from percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move common parts out of pcpu_[de]populate_chunk()Tejun Heo1-9/+30
percpu-vm and percpu-km implement separate versions of pcpu_[de]populate_chunk() and some part which is or should be common are currently in the specific implementations. Make the following changes. * Allocate area clearing is moved from the pcpu_populate_chunk() implementations to pcpu_alloc(). This makes percpu-km's version noop. * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved to their respective callers so that they are applied to percpu-km too. This doesn't make any meaningful difference as both functions are noop for percpu-km; however, this is more consistent and will help implementing atomic allocation support. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-16percpu: free percpu allocation info for uniprocessor systemHonggang Li1-0/+2
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-06-19percpu: Use ALIGN macro instead of hand coding alignment calculationChristoph Lameter1-2/+1
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-14percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan1-1/+1
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Cc: stable@vger.kernel.org
2014-03-29percpu: renew the max_contig if we merge the head and previous blockJianyu Zhan1-1/+3
During pcpu_alloc_area(), we might merge the current head with the previous block. Since we have calculated the max_contig using the size of previous block before we skip it, and now we update the size of previous block, so we should renew the max_contig. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-17percpu: allocation size should be evenViro1-1/+4
723ad1d90b56 ("percpu: store offsets instead of lengths in ->map[]") updated percpu area allocator to use the lowest bit, instead of sign, to signify whether the area is occupied and forced min align to 2; unfortunately, it forgot to force the allocation size to be even causing malfunctions for the very rare odd-sized allocations. Always force the allocations to be even sized. tj: Wrote patch description. Original-patch-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: speed alloc_pcpu_area() upAl Viro1-1/+17
If we know that first N areas are all in use, we can obviously skip them when searching for a free one. And that kind of hint is very easy to maintain. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: store offsets instead of lengths in ->map[]Al Viro1-55/+81
Current code keeps +-length for each area in chunk->map[]. It has several unpleasant consequences: * even if we know that first 50 areas are all in use, allocation still needs to go through all those areas just to sum their sizes, just to get the offset of free one. * freeing needs to find the array entry refering to the area in question; again, the need to sum the sizes until we reach the offset we are interested in. Note that offsets are monotonous, so simple binary search would do here. New data representation: array of <offset,in-use flag> pairs. Each pair is represented by one int - we use offset|1 for <offset, in use> and offset for <offset, free> (we make sure that all offsets are even). In the end we put a sentry entry - <total size, in use>. The first entry is <0, flag>; it would be possible to store together the flag for Nth area and offset for N+1st, but that leads to much hairier code. In other words, where the old variant would have 4, -8, -4, 4, -12, 100 (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store <0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1> i.e. 0, 5, 13, 16, 21, 32, 133 This commit switches to new data representation and takes care of a couple of low-hanging fruits in free_pcpu_area() - one is the switch to binary search, another is not doing two memmove() when one would do. Speeding the alloc side up (by keeping track of how many areas in the beginning are known to be all in use) also becomes possible - that'll be done in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07perpcu: fold pcpu_split_block() into the only callerAl Viro1-47/+16
... and simplify the results a bit. Makes the next step easier to deal with - we will be changing the data representation for chunk->map[] and it's easier to do if the code in question is not split between pcpu_alloc_area() and pcpu_split_block(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-21Merge branch 'akpm' (incoming from Andrew)Linus Torvalds1-16/+22
Merge first patch-bomb from Andrew Morton: - a couple of misc things - inotify/fsnotify work from Jan - ocfs2 updates (partial) - about half of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) mm/migrate: remove unused function, fail_migrate_page() mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages mm/migrate: correct failure handling if !hugepage_migration_support() mm/migrate: add comment about permanent failure path mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure mm: compaction: reset scanner positions immediately when they meet mm: compaction: do not mark unmovable pageblocks as skipped in async compaction mm: compaction: detect when scanners meet in isolate_freepages mm: compaction: reset cached scanner pfn's before reading them mm: compaction: encapsulate defer reset logic mm: compaction: trace compaction begin and end memcg, oom: lock mem_cgroup_print_oom_info sched: add tracepoints related to NUMA task migration mm: numa: do not automatically migrate KSM pages mm: numa: trace tasks that fail migration due to rate limiting mm: numa: limit scope of lock for NUMA migrate rate limiting mm: numa: make NUMA-migrate related functions static lib/show_mem.c: show num_poisoned_pages when oom mm/hwpoison: add '#' to hwpoison_inject mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter ...
2014-01-21mm/percpu.c: use memblock apis for early memory allocationsSantosh Shilimkar1-16/+22
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Lindgren <tony@atomide.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21percpu: use VMALLOC_TOTAL instead of VMALLOC_END - VMALLOC_STARTLaura Abbott1-2/+2
vmalloc already gives a useful macro to calculate the total vmalloc size. Use it. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-23percpu: fix bootmem error handling in pcpu_page_first_chunk()Michael Holzheu1-2/+3
If memory allocation of in pcpu_embed_first_chunk() fails, the allocated memory is not released correctly. In the release loop also the non-allocated elements are released which leads to the following kernel BUG on systems with very little memory: [ 0.000000] kernel BUG at mm/bootmem.c:307! [ 0.000000] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22 [ 0.000000] task: 0000000000a20ae0 ti: 0000000000a08000 task.ti: 0000000000a08000 [ 0.000000] Krnl PSW : 0400000180000000 0000000000abda7a (__free+0x116/0x154) [ 0.000000] R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 EA:3 ... [ 0.000000] [<0000000000abdce2>] mark_bootmem_node+0xde/0xf0 [ 0.000000] [<0000000000abdd9c>] mark_bootmem+0xa8/0x118 [ 0.000000] [<0000000000abcbba>] pcpu_embed_first_chunk+0xe7a/0xf0c [ 0.000000] [<0000000000abcc96>] setup_per_cpu_areas+0x4a/0x28c To fix the problem now only allocated elements are released. This then leads to the correct kernel panic: [ 0.000000] Kernel panic - not syncing: Failed to initialize percpu areas. ... [ 0.000000] Call Trace: [ 0.000000] ([<000000000011307e>] show_trace+0x132/0x150) [ 0.000000] [<0000000000113160>] show_stack+0xc4/0xd4 [ 0.000000] [<00000000007127dc>] dump_stack+0x74/0xd8 [ 0.000000] [<00000000007123fe>] panic+0xea/0x264 [ 0.000000] [<0000000000b14814>] setup_per_cpu_areas+0x5c/0x28c tj: Flipped if conditional so that it doesn't need "continue". Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-02mm, percpu: Make sure percpu_alloc early parameter has an argumentCyrill Gorcunov1-0/+3
Otherwise we are getting a nil dereference if percpu_alloc kernel boot argument is specified without value. | [ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null) | [ 0.000000] IP: [<ffffffff81391360>] strcmp+0x10/0x30 Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-29percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()Joonsoo Kim1-1/+1
commit 099a19d9('allow limited allocation before slab is online') made pcpu_alloc_chunk() use pcpu_mem_zalloc() but forgot to update pcpu_free_chunk() accordingly. This doesn't cause any immediate problema, but fix it for consistency. tj: commit message updated Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-06sections: fix section conflicts in mm/percpu.cAndi Kleen1-1/+1
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-09kmemleak: Fix the kmemleak tracking of the percpu areas with !SMPCatalin Marinas1-0/+2
Kmemleak tracks the percpu allocations via a specific API and the originally allocated areas must be removed from kmemleak (via kmemleak_free). The code was already doing this for SMP systems. Reported-by: Sami Liedes <sami.liedes@iki.fi> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-09percpu: pcpu_embed_first_chunk() should free unused parts after all allocs ↵Tejun Heo1-0/+10
are complete pcpu_embed_first_chunk() allocates memory for each node, copies percpu data and frees unused portions of it before proceeding to the next group. This assumes that allocations for different nodes doesn't overlap; however, depending on memory topology, the bootmem allocator may end up allocating memory from a different node than the requested one which may overlap with the portion freed from one of the previous percpu areas. This leads to percpu groups for different nodes overlapping which is a serious bug. This patch separates out copy & partial free from the allocation loop such that all allocations are complete before partial frees happen. This also fixes overlapping frees which could happen on allocation failure path - out_free_areas path frees whole groups but the groups could have portions freed at that point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru> Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru> LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
2012-03-29percpu: use KERN_CONT in pcpu_dump_alloc_info()Tejun Heo1-5/+5
pcpu_dump_alloc_info() was printing continued lines without KERN_CONT. Use it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Kay Sievers <kay.sievers@vrfy.org>
2012-01-14Merge tag 'kmemleak' of ↵Linus Torvalds1-1/+11
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux Kmemleak patches Main features: - Handle percpu memory allocations (only scanning them, not actually reporting). - Memory hotplug support. Usability improvements: - Show the origin of early allocations. - Report previously found leaks even if kmemleak has been disabled by some error. * tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux: kmemleak: Add support for memory hotplug kmemleak: Handle percpu memory allocation kmemleak: Report previously found leaks even after an error kmemleak: When the early log buffer is exceeded, report the actual number kmemleak: Show where early_log issues come from