aboutsummaryrefslogtreecommitdiffstats
path: root/init/main.c
AgeCommit message (Collapse)AuthorFilesLines
2025-11-12init/main.c: wrap long kernel cmdline when printing to logsDouglas Anderson1-1/+96
The kernel cmdline length is allowed to be longer than what printk can handle. When this happens the cmdline that's printed to the kernel ring buffer at bootup is cutoff and some kernel cmdline options are "hidden" from the logs. This undercuts the usefulness of the log message. Specifically, grepping for COMMAND_LINE_SIZE shows that 2048 is common and some architectures even define it as 4096. s390 allows a CONFIG-based maximum up to 1MB (though it's not expected that anyone will go over the default max of 4096 [1]). The maximum message pr_notice() seems to be able to handle (based on experiment) is 1021 characters. This appears to be based on the current value of PRINTKRB_RECORD_MAX as 1024 and the fact that pr_notice() spends 2 characters on the loglevel prefix and we have a '\n' at the end. While it would be possible to increase the limits of printk() (and therefore pr_notice()) somewhat, it doesn't appear possible to increase it enough to fully include a 2048-character cmdline without breaking userspace. Specifically on at least two tested userspaces (ChromeOS plus the Debian-based distro I'm typing this message on) the `dmesg` tool reads lines from `/dev/kmsg` in 2047-byte chunks. As per `Documentation/ABI/testing/dev-kmsg`: Every read() from the opened device node receives one record of the kernel's printk buffer. ... Messages in the record ring buffer get overwritten as whole, there are never partial messages received by read(). We simply can't fit a 2048-byte cmdline plus the "Kernel command line:" prefix plus info about time/log_level/etc in a 2047-byte read. The above means that if we want to avoid the truncation we need to do some type of wrapping of the cmdline when printing. Add wrapping to the printout of the kernel command line. By default, the wrapping is set to 1021 characters to avoid breaking anyone, but allow wrapping to be set lower by a Kconfig knob "CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN". Any tools that are correctly parsing the cmdline today (because it is less than 1021 characters) will see no difference in their behavior. The format of wrapped output is designed to be matched by anyone using "grep" to search for the cmdline and also to be easy for tools to handle. Anyone who is sure their tools (if any) handle the wrapped format can choose a lower wrapping value and have prettier output. Setting CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN to 0 fully disables the wrapping logic. This means that long command lines will be truncated again, but this config could be set if command lines are expected to be long and userspace is known not to handle parsing logs with the wrapping. Wrapping is based on spaces, ignoring quotes. All lines are prefixed with "Kernel command line: " and lines that are not the last line have a " \" suffix added to them. The prefix and suffix count towards the line length for wrapping purposes. The ideal length will be exceeded if no appropriate place to wrap is found. The wrapping function added here is fairly generic and could be made a library function (somewhat like print_hex_dump()) if it's needed elsewhere in the kernel. However, having printk() directly incorporate this wrapping would be unlikely to be a good idea since it would break printouts into more than one record without any obvious common line prefix to tie lines together. It would also be extra overhead when, in general, kernel log message should simply be kept smaller than 1021 bytes. For some discussion on this topic, see responses to the v1 posting of this patch [2]. [akpm@linux-foundation.org: make print_kernel_cmdline __init] [dianders@chromium.org: v4] Link: https://lkml.kernel.org/r/20251027082204.v4.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid Link: https://lkml.kernel.org/r/20251023113257.v3.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid Link: https://lore.kernel.org/r/20251021131633.26700Dd6-hca@linux.ibm.com [1] Link: https://lore.kernel.org/r/CAD=FV=VNyt1zG_8pS64wgV8VkZWiWJymnZ-XCfkrfaAhhFSKcA@mail.gmail.com [2] Signed-off-by: Douglas Anderson <dianders@chromium.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Chant <achant@google.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Francesco Valla <francesco@valla.it> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Hendrik Farr <kernel@jfarr.cc> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-02Merge tag 'mm-nonmm-stable-2025-10-02-15-29' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet completes the removal of this legacy IDR API - "panic: introduce panic status function family" from Jinchao Wang provides a number of cleanups to the panic code and its various helpers, which were rather ad-hoc and scattered all over the place - "tools/delaytop: implement real-time keyboard interaction support" from Fan Yu adds a few nice user-facing usability changes to the delaytop monitoring tool - "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos Petrongonas fixes a panic which was happening with the combination of EFI and KHO - "Squashfs: performance improvement and a sanity check" from Phillip Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere 150x speedup was measured for a well-chosen microbenchmark - plus another 50-odd singleton patches all over the place * tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits) Squashfs: reject negative file sizes in squashfs_read_inode() kallsyms: use kmalloc_array() instead of kmalloc() MAINTAINERS: update Sibi Sankar's email address Squashfs: add SEEK_DATA/SEEK_HOLE support Squashfs: add additional inode sanity checking lib/genalloc: fix device leak in of_gen_pool_get() panic: remove CONFIG_PANIC_ON_OOPS_VALUE ocfs2: fix double free in user_cluster_connect() checkpatch: suppress strscpy warnings for userspace tools cramfs: fix incorrect physical page address calculation kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit Squashfs: fix uninit-value in squashfs_get_parent kho: only fill kimage if KHO is finalized ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name() kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock coccinelle: platform_no_drv_owner: handle also built-in drivers coccinelle: of_table: handle SPI device ID tables lib/decompress: use designated initializers for struct compress_format efi: support booting with kexec handover (KHO) ...
2025-09-29Merge tag 'namespace-6.18-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_*() helpers net: port to ns_ref_*() helpers uts: port to ns_ref_*() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_*() helpers time: port to ns_ref_*() helpers pid: port to ns_ref_*() helpers ipc: port to ns_ref_*() helpers cgroup: port to ns_ref_*() helpers ...
2025-09-19time: support ns lookupChristian Brauner1-0/+2
Support the generic ns lookup infrastructure to support file handles for namespaces. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-13init: handle bootloader identifier in kernel parametersHuacai Chen1-0/+12
BootLoaders (Grub, LILO, etc) may pass an identifier such as "BOOT_IMAGE= /boot/vmlinuz-x.y.z" to kernel parameters. But these identifiers are not recognized by the kernel itself so will be passed to userspace. However user space init program also don't recognize it. KEXEC/KDUMP (kexec-tools) may also pass an identifier such as "kexec" on some architectures. We cannot change BootLoader's behavior, because this behavior exists for many years, and there are already user space programs search BOOT_IMAGE= in /proc/cmdline to obtain the kernel image locations: https://github.com/linuxdeepin/deepin-ab-recovery/blob/master/util.go (search getBootOptions) https://github.com/linuxdeepin/deepin-ab-recovery/blob/master/main.go (search getKernelReleaseWithBootOption) So the the best way is handle (ignore) it by the kernel itself, which can avoid such boot warnings (if we use something like init=/bin/bash, bootloader identifier can even cause a crash): Kernel command line: BOOT_IMAGE=(hd0,1)/vmlinuz-6.x root=/dev/sda3 ro console=tty Unknown kernel command line parameters "BOOT_IMAGE=(hd0,1)/vmlinuz-6.x", will be passed to user space. [chenhuacai@loongson.cn: use strstarts()] Link: https://lkml.kernel.org/r/20250815090120.1569947-1-chenhuacai@loongson.cn Link: https://lkml.kernel.org/r/20250721101343.3283480-1-chenhuacai@loongson.cn Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-03init/main.c: fix boot time tracing crashMike Rapoport (Microsoft)1-1/+1
Steven Rostedt reported a crash with "ftrace=function" kernel command line: [ 0.159269] BUG: kernel NULL pointer dereference, address: 000000000000001c [ 0.160254] #PF: supervisor read access in kernel mode [ 0.160975] #PF: error_code(0x0000) - not-present page [ 0.161697] PGD 0 P4D 0 [ 0.162055] Oops: Oops: 0000 [#1] SMP PTI [ 0.162619] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc2-test-00006-g48d06e78b7cb-dirty #9 PREEMPT(undef) [ 0.164141] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 0.165439] RIP: 0010:kmem_cache_alloc_noprof (mm/slub.c:4237) [ 0.166186] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 e4 f0 48 83 ec 20 8b 05 c9 b6 7e 01 <44> 8b 77 1c 65 4c 8b 2d b5 ea 20 02 4c 89 6c 24 18 41 89 f5 21 f0 [ 0.168811] RSP: 0000:ffffffffb2e03b30 EFLAGS: 00010086 [ 0.169545] RAX: 0000000001fff33f RBX: 0000000000000000 RCX: 0000000000000000 [ 0.170544] RDX: 0000000000002800 RSI: 0000000000002800 RDI: 0000000000000000 [ 0.171554] RBP: ffffffffb2e03b80 R08: 0000000000000004 R09: ffffffffb2e03c90 [ 0.172549] R10: ffffffffb2e03c90 R11: 0000000000000000 R12: 0000000000000000 [ 0.173544] R13: ffffffffb2e03c90 R14: ffffffffb2e03c90 R15: 0000000000000001 [ 0.174542] FS: 0000000000000000(0000) GS:ffff9d2808114000(0000) knlGS:0000000000000000 [ 0.175684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.176486] CR2: 000000000000001c CR3: 000000007264c001 CR4: 00000000000200b0 [ 0.177483] Call Trace: [ 0.177828] <TASK> [ 0.178123] mas_alloc_nodes (lib/maple_tree.c:176 (discriminator 2) lib/maple_tree.c:1255 (discriminator 2)) [ 0.178692] mas_store_gfp (lib/maple_tree.c:5468) [ 0.179223] execmem_cache_add_locked (mm/execmem.c:207) [ 0.179870] execmem_alloc (mm/execmem.c:213 mm/execmem.c:313 mm/execmem.c:335 mm/execmem.c:475) [ 0.180397] ? ftrace_caller (arch/x86/kernel/ftrace_64.S:169) [ 0.180922] ? __pfx_ftrace_caller (arch/x86/kernel/ftrace_64.S:158) [ 0.181517] execmem_alloc_rw (mm/execmem.c:487) [ 0.182052] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:266 arch/x86/kernel/ftrace.c:344 arch/x86/kernel/ftrace.c:474) [ 0.182778] ? ftrace_caller_op_ptr (arch/x86/kernel/ftrace_64.S:182) [ 0.183388] ftrace_update_trampoline (kernel/trace/ftrace.c:7947) [ 0.184024] __register_ftrace_function (kernel/trace/ftrace.c:368) [ 0.184682] ftrace_startup (kernel/trace/ftrace.c:3048) [ 0.185205] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210) [ 0.185877] register_ftrace_function_nolock (kernel/trace/ftrace.c:8717) [ 0.186595] register_ftrace_function (kernel/trace/ftrace.c:8745) [ 0.187254] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210) [ 0.187924] function_trace_init (kernel/trace/trace_functions.c:170) [ 0.188499] tracing_set_tracer (kernel/trace/trace.c:5916 kernel/trace/trace.c:6349) [ 0.189088] register_tracer (kernel/trace/trace.c:2391) [ 0.189642] early_trace_init (kernel/trace/trace.c:11075 kernel/trace/trace.c:11149) [ 0.190204] start_kernel (init/main.c:970) [ 0.190732] x86_64_start_reservations (arch/x86/kernel/head64.c:307) [ 0.191381] x86_64_start_kernel (??:?) [ 0.191955] common_startup_64 (arch/x86/kernel/head_64.S:419) [ 0.192534] </TASK> [ 0.192839] Modules linked in: [ 0.193267] CR2: 000000000000001c [ 0.193730] ---[ end trace 0000000000000000 ]--- The crash happens because on x86 ftrace allocations from execmem require maple tree to be initialized. Move maple tree initialization that depends only on slab availability earlier in boot so that it will happen right after mm_core_init(). Link: https://lkml.kernel.org/r/20250824130759.1732736-1-rppt@kernel.org Fixes: 5d79c2be5081 ("x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250820184743.0302a8b5@gandalf.local.home/ Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-03Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Significant patch series in this pull request: - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets us closer to being able to remove page->mapping - "relayfs: misc changes" (Jason Xing) does some maintenance and minor feature addition work in relayfs - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches us from static preallocation of the kdump crashkernel's working memory over to dynamic allocation. So the difficulty of a-priori estimation of the second kernel's needs is removed and the first kernel obtains extra memory - "generalize panic_print's dump function to be used by other kernel parts" (Feng Tang) implements some consolidation and rationalization of the various ways in which a failing kernel splats information at the operator * tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits) tools/getdelays: add backward compatibility for taskstats version kho: add test for kexec handover delaytop: enhance error logging and add PSI feature description samples: Kconfig: fix spelling mistake "instancess" -> "instances" fat: fix too many log in fat_chain_add() scripts/spelling.txt: add notifer||notifier to spelling.txt xen/xenbus: fix typo "notifer" net: mvneta: fix typo "notifer" drm/xe: fix typo "notifer" cxl: mce: fix typo "notifer" KVM: x86: fix typo "notifer" MAINTAINERS: add maintainers for delaytop ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below() ucount: fix atomic_long_inc_below() argument type kexec: enable CMA based contiguous allocation stackdepot: make max number of pools boot-time configurable lib/xxhash: remove unused functions init/Kconfig: restore CONFIG_BROKEN help text lib/raid6: update recov_rvv.c zero page usage docs: update docs after introducing delaytop ...
2025-07-19init/main.c: add warning when file specified in rdinit is inaccessibleLillian Berry1-1/+5
Avoid silently ignoring the initramfs when the file specified in rdinit is not usable. This prints an error that clearly explains the issue (file was not found, vs initramfs was not found). Link: https://lkml.kernel.org/r/20250707091411.1412681-1-lillian@star-ark.net Signed-off-by: Lillian Berry <lillian@star-ark.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-20x86/efi: Move runtime service initialization to arch/x86Alexander Shishkin1-5/+0
The EFI call in start_kernel() is guarded by #ifdef CONFIG_X86. Move the thing to the arch_cpu_finalize_init() path on x86 and get rid of the #ifdef in start_kernel(). No functional change intended. Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/all/20250620135325.3300848-5-kirill.shutemov%40linux.intel.com
2025-06-11init: fix build warnings about export.hHuacai Chen1-0/+1
After commit a934a57a42f64a4 ("scripts/misc-check: check missing #include <linux/export.h> when W=1") and 7d95680d64ac8e836c ("scripts/misc-check: check unnecessary #include <linux/export.h> when W=1"), we get some build warnings with W=1: init/main.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing init/initramfs.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing So fix these build warnings for the init code. Link: https://lkml.kernel.org/r/20250608141235.155206-1-chenhuacai@loongson.cn Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds1-2/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-12mm/memcg: move mem_cgroup_init() ahead of cgroup_init()Huan Yang1-0/+2
Patch series "Use kmem_cache for memcg alloc", v3. (willy tldr: "you've gone from allocating 8 objects per 32KiB to allocating 13 objects per 32KiB, a 62% improvement in memory consumption" [1]) The mem_cgroup_alloc function creates mem_cgroup struct and it's associated structures including mem_cgroup_per_node. Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel, 1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure), we can observe the memory allocation for these structures using the following shell commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup # Trigger allocation if cgroup subtree do not enable memcg echo +memory > /sys/fs/cgroup/cgroup.subtree_control Ftrace Output: # mem_cgroup struct allocation sh-6312 [000] ..... 58015.698365: kmalloc: call_site=mem_cgroup_css_alloc+0xd8/0x5b4 ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-6312 [000] ..... 58015.698389: kmalloc: call_site=mem_cgroup_css_alloc+0x1d8/0x5b4 ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false Key Observations: 1. Both structures use kmalloc with requested sizes between 2KB-4KB 2. Allocation alignment forces 4KB slab usage due to pre-defined sizes (64B, 128B,..., 2KB, 4KB, 8KB) 3. Memory waste per memcg instance: Base struct: 4096 - 2312 = 1784 bytes Per-node struct: 4096 - 2896 = 1200 bytes Total waste: 2984 bytes (1-node system) NUMA scaling: (1200 + 8) * nr_node_ids bytes So, it's a little waste. This patchset introduces dedicated kmem_cache: Patch2 - mem_cgroup kmem_cache - memcg_cachep Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep The benefits of this change can be observed with the following tracing commands: # Enable tracing echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable echo 1 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup # In another terminal: echo +memory > /sys/fs/cgroup/cgroup.subtree_control The output might now look like this: # mem_cgroup struct allocation sh-9827 [000] ..... 289.513598: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806 bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9827 [000] ..... 289.513602: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false This indicates that the `mem_cgroup` struct now requests 2312 bytes and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes and is allocated 2944 bytes. The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the `kmem_cache`. Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as: # mem_cgroup struct allocation sh-9269 [003] ..... 80.396366: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475 bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false # mem_cgroup_per_node allocation sh-9269 [003] ..... 80.396411: kmem_cache_alloc: call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6 bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial for performance. Please let me know if there are any issues or if I've misunderstood anything. This patchset also move mem_cgroup_init ahead of cgroup_init() due to cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before use it. This patch (of 3): When cgroup_init() creates root_mem_cgroup through css_alloc callback, some critical resources might not be fully initialized, forcing later operations to perform conditional checks for resource availability. This patch move mem_cgroup_init() to address the init order, it invoke before cgroup_init, so, compare to subsys_initcall, it can use to prepare some key resources before root_mem_cgroup alloc. Link: https://lkml.kernel.org/r/aAsRCj-niMMTtmK8@casper.infradead.org [1] Link: https://lkml.kernel.org/r/20250425031935.76411-1-link@vivo.com Link: https://lkml.kernel.org/r/20250425031935.76411-2-link@vivo.com Signed-off-by: Huan Yang <link@vivo.com> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Francesco Valla <francesco@valla.it> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Huang Shijie <shijie@os.amperecomputing.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Raul E Rangel <rrangel@chromium.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11init/main.c: log initcall level when initcall_debug is usedFrancesco Valla1-2/+16
When initcall_debug is specified on the command line, the start and return point for each initcall is printed. However, no information on the initcall level is reported. Add to the initcall_debug infrastructure an additional print that informs when a new initcall level is entered. This is particularly useful when debugging dependency chains and/or working on boot time reduction. Link: https://lkml.kernel.org/r/20250316205014.2830071-2-francesco@valla.it Signed-off-by: Francesco Valla <francesco@valla.it> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Bird <tim.bird@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08timers: Rename init_timers() as timers_init()Ingo Molnar1-1/+1
Move this API to the canonical timers_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-8-mingo@kernel.org
2025-03-04rcu-tasks: Move RCU Tasks self-tests to core_initcall()Paul E. McKenney1-1/+0
The timer and hrtimer softirq processing has moved to dedicated threads for kernels built with CONFIG_IRQ_FORCED_THREADING=y. This results in timers not expiring until later in early boot, which in turn causes the RCU Tasks self-tests to hang in kernels built with CONFIG_PROVE_RCU=y, which further causes the entire kernel to hang. One fix would be to make timers work during this time, but there are no known users of RCU Tasks grace periods during that time, so no justification for the added complexity. Not yet, anyway. This commit therefore moves the call to rcu_init_tasks_generic() from kernel_init_freeable() to a core_initcall(). This works because the timer and hrtimer kthreads are created at early_initcall() time. Fixes: 49a17639508c3 ("softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: <linux-trace-kernel@vger.kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds1-14/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-25mm/memblock: add memblock_alloc_or_panic interfaceGuo Weikang1-14/+4
Before SLUB initialization, various subsystems used memblock_alloc to allocate memory. In most cases, when memory allocation fails, an immediate panic is required. To simplify this behavior and reduce repetitive checks, introduce `memblock_alloc_or_panic`. This function ensures that memory allocation failures result in a panic automatically, improving code readability and consistency across subsystems that require this behavior. [guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic] Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/ Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-11rcu/kvfree: Initialize kvfree_rcu() separatelyUladzislau Rezki (Sony)1-0/+1
Introduce a separate initialization of kvfree_rcu() functionality. For such purpose a kfree_rcu_batch_init() is renamed to a kvfree_rcu_init() and it is invoked from the main.c right after rcu_init() is done. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-01init: Don't proxy `console=` to earlyconRaul E Rangel1-4/+1
Today we are proxying the `console=` command line args to the `param_setup_earlycon()` handler. This is done because the following are equivalent: console=uart[8250],mmio,<addr>[,options] earlycon=uart[8250],mmio,<addr>[,options] Both invocations enable an early `bootconsole`. `console=uartXXXX` is just an alias for `earlycon=uartXXXX`. In addition, when `earlycon=` (empty value) or just `earlycon` (no value) is specified on the command line, we enable the earlycon `bootconsole` specified by the SPCR table or the DT. The problem arises when `console=` (empty value) is specified on the command line. It's intention is to disable the `console`, but what happens instead is that the SPRC/DT console gets enabled. This happens because we are proxying the `console=` (empty value) parameter to the `earlycon` handler. The `earlycon` handler then sees that the parameter value is empty, so it enables the SPCR/DT `bootconsole`. This change makes it so that the `console` or `console=` parameters no longer enable the SPCR/DT `bootconsole`. I also cleans up the hack in `main.c` that would forward the `console` parameter to the `earlycon` handler. Signed-off-by: Raul E Rangel <rrangel@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240911123507.v2.1.Id08823b2f848237ae90ce5c5fa7e027e97c33ad3@changeid Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-20init/main.c: Initialize early LSMs after arch code, static keys and calls.KP Singh1-2/+4
With LSMs using static calls and static keys, early_lsm_init needs to wait for setup_arch for architecture specific functionality which includes jump tables and static calls to be initialized. Since not all architectures call jump_table_init in setup_arch, explicitly call both jump_table_init and static_call_init before early_security_init. This only affects "early LSMs" i.e. only lockdown when CONFIG_SECURITY_LOCKDOWN_LSM_EARLY is set. Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-05-19Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds1-0/+19
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-17Merge tag 'bootconfig-v6.10' of ↵Linus Torvalds1-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull bootconfig updates from Masami Hiramatsu: - Do not put unneeded quotes on the extra command line items which was inserted from the bootconfig. - Remove redundant spaces from the extra command line. * tag 'bootconfig-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: init/main.c: Minor cleanup for the setup_command_line() function init/main.c: Remove redundant space from saved_command_line bootconfig: do not put quotes on cmdline items unless necessary
2024-05-13Merge tag 'locking-core-2024-05-13' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Over a dozen code generation micro-optimizations for the atomic and spinlock code - Add more __ro_after_init attributes - Robustify the lockdevent_*() macros * tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro locking/qspinlock/x86: Micro-optimize virt_spin_lock() locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu() locking/atomic/x86: Introduce arch_try_cmpxchg64_local() locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock() locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending() locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail() locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32 locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32 locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64 locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}() locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128() x86/tsc: Make __use_tsc __ro_after_init x86/kvm: Make kvm_async_pf_enabled __ro_after_init context_tracking: Make context_tracking_key __ro_after_init jump_label,module: Don't alloc static_key_mod for __ro_after_init keys locking/qspinlock: Always evaluate lockevent* non-event parameter once
2024-04-25NUMA: early use of cpu_to_node() returns 0 instead of the correct node idHuang Shijie1-0/+14
During the kernel booting, the generic cpu_to_node() is called too early in arm64, powerpc and riscv when CONFIG_NUMA is enabled. There are at least four places in the common code where the generic cpu_to_node() is called before it is initialized: 1.) early_trace_init() in kernel/trace/trace.c 2.) sched_init() in kernel/sched/core.c 3.) init_sched_fair_class() in kernel/sched/fair.c 4.) workqueue_init_early() in kernel/workqueue.c This will harm performance since there is an increase in off node accesses. In order to fix the bug, the patch introduces early_numa_node_init() which is called after smp_prepare_boot_cpu() in start_kernel. early_numa_node_init will initialize the "numa_node" as soon as the early_cpu_to_node() is ready, before the cpu_to_node() is called at the first time. Link: https://lkml.kernel.org/r/20240126064451.5465-1-shijie@os.amperecomputing.com Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [RISC-V] Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michael Kelley (LINUX) <mikelley@microsoft.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25bootconfig: do not put quotes on cmdline items unless necessaryRasmus Villemoes1-3/+9
When trying to migrate to using bootconfig to embed the kernel's and PID1's command line with the kernel image itself, and so allowing changing that without modifying the bootloader, I noticed that /proc/cmdline changed from e.g. console=ttymxc0,115200n8 cma=128M quiet -- --log-level=notice to console="ttymxc0,115200n8" cma="128M" quiet -- --log-level="notice" The kernel parameters are parsed just fine, and the quotes are indeed stripped from the actual argv[] given to PID1. However, the quoting doesn't really serve any purpose and looks excessive, and might confuse some (naive) userspace tool trying to parse /proc/cmdline. So do not quote the value unless it contains whitespace. Link: https://lkml.kernel.org/r/20240320101952.62135-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-13init/main.c: Minor cleanup for the setup_command_line() functionYuntao Wang1-3/+3
This is just a minor cleanup to make the code look a bit cleaner. Link: https://lore.kernel.org/all/20240412081733.35925-3-ytcoode@gmail.com/ Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-13init/main.c: Remove redundant space from saved_command_lineYuntao Wang1-1/+3
There is a space at the end of extra_init_args. In the current logic, copying extra_init_args to saved_command_line will cause extra spaces in saved_command_line here or there. Remove the trailing space from extra_init_args to make the string in saved_command_line look more perfect. Link: https://lore.kernel.org/all/20240412032950.12687-1-ytcoode@gmail.com/ Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-13bootconfig: do not put quotes on cmdline items unless necessaryRasmus Villemoes1-3/+4
When trying to migrate to using bootconfig to embed the kernel's and PID1's command line with the kernel image itself, and so allowing changing that without modifying the bootloader, I noticed that /proc/cmdline changed from e.g. console=ttymxc0,115200n8 cma=128M quiet -- --log-level=notice to console="ttymxc0,115200n8" cma="128M" quiet -- --log-level="notice" The kernel parameters are parsed just fine, and the quotes are indeed stripped from the actual argv[] given to PID1. However, the quoting doesn't really serve any purpose and looks excessive, and might confuse some (naive) userspace tool trying to parse /proc/cmdline. So do not quote the value unless it contains whitespace. Link: https://lore.kernel.org/all/20240308124401.1702046-1-linux@rasmusvillemoes.dk/ Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-13init/main.c: Fix potential static_command_line memory overflowYuntao Wang1-0/+2
We allocate memory of size 'xlen + strlen(boot_command_line) + 1' for static_command_line, but the strings copied into static_command_line are extra_command_line and command_line, rather than extra_command_line and boot_command_line. When strlen(command_line) > strlen(boot_command_line), static_command_line will overflow. This patch just recovers strlen(command_line) which was miss-consolidated with strlen(boot_command_line) in the commit f5c7310ac73e ("init/main: add checks for the return value of memblock_alloc*()") Link: https://lore.kernel.org/all/20240412081733.35925-2-ytcoode@gmail.com/ Fixes: f5c7310ac73e ("init/main: add checks for the return value of memblock_alloc*()") Cc: stable@vger.kernel.org Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-09fs/proc: Skip bootloader comment if no embedded kernel parametersMasami Hiramatsu1-0/+5
If the "bootconfig" kernel command-line argument was specified or if the kernel was built with CONFIG_BOOT_CONFIG_FORCE, but if there are no embedded kernel parameter, omit the "# Parameters from bootloader:" comment from the /proc/bootconfig file. This will cause automation to fall back to the /proc/cmdline file, which will be identical to the comment in this no-embedded-kernel-parameters case. Link: https://lore.kernel.org/all/20240409044358.1156477-2-paulmck@kernel.org/ Fixes: 8b8ce6c75430 ("fs/proc: remove redundant comments from /proc/bootconfig") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-22jump_label,module: Don't alloc static_key_mod for __ro_after_init keysPeter Zijlstra1-0/+1
When a static_key is marked ro_after_init, its state will never change (after init), therefore jump_label_update() will never need to iterate the entries, and thus module load won't actually need to track this -- avoiding the static_key::next write. Therefore, mark these keys such that jump_label_add_module() might recognise them and avoid the modification. Use the special state: 'static_key_linked(key) && !static_key_mod(key)' to denote such keys. jump_label_add_module() does not exist under CONFIG_JUMP_LABEL=n, so the newly-introduced jump_label_init_ro() can be defined as a nop for that configuration. [ mingo: Renamed jump_label_ro() to jump_label_init_ro() ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240313180106.2917308-2-vschneid@redhat.com
2024-03-14Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds1-7/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
2024-03-14Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds1-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-13Merge tag 'modules-6.9-rc1' of ↵Linus Torvalds1-14/+7
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Christophe Leroy did most of the work on this release, first with a few cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error handling for when set_memory_XX() can fail. This is part of a larger effort to clean up all these callers which can fail, modules is just part of it" * tag 'modules-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: module: Don't ignore errors from set_memory_XX() lib/test_kmod: fix kernel-doc warnings powerpc: Simplify strict_kernel_rwx_enabled() modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled init: Declare rodata_enabled and mark_rodata_ro() at all time module: Change module_enable_{nx/x/ro}() to more explicit names module: Use set_memory_rox()
2024-03-11Merge tag 'x86-cleanups-2024-03-11' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups, including a large series from Thomas Gleixner to cure sparse warnings" * tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi: Drop unused declaration of proc_nmi_enabled() x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address() x86/percpu: Cure per CPU madness on UP smp: Consolidate smp_prepare_boot_cpu() x86/msr: Add missing __percpu annotations x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h> perf/x86/amd/uncore: Fix __percpu annotation x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP) x86/apm_32: Remove dead function apm_get_battery_status() x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-0/+1
Pull workqueue updates from Tejun Heo: "This cycle, a lot of workqueue changes including some that are significant and invasive. - During v6.6 cycle, unbound workqueues were updated so that they are more topology aware and flexible, which among other things improved workqueue behavior on modern multi-L3 CPUs. In the process, commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") switched unbound workqueues to use per-CPU frontend pool_workqueues as a part of increasing front-back mapping flexibility. An unwelcome side effect of this change was that this made max concurrency enforcement per-CPU blowing up the maximum number of allowed concurrent executions. I incorrectly assumed that this wouldn't cause practical problems as most unbound workqueue users are self-regulate max concurrency; however, there definitely are which don't (e.g. on IO paths) and the drastic increase in the allowed max concurrency led to noticeable perf regressions in some use cases. This is now addressed by separating out max concurrency enforcement to a separate struct - wq_node_nr_active - which makes @max_active consistently mean system-wide max concurrency regardless of the number of CPUs or (finally) NUMA nodes. This is a rather invasive and, in places, a bit clunky; however, the clunkiness rises from the the inherent requirement to handle the disagreement between the execution locality domain and max concurrency enforcement domain on some modern machines. See commit 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") for more details. - BH workqueue support is added. They are similar to per-CPU workqueues but execute work items in the softirq context. This is expected to replace tasklet. However, currently, it's missing the ability to disable and enable work items which is needed to convert many tasklet users. To avoid crowding this merge window too much, this will be included in the next merge window. A separate pull request will be sent for the couple conversion patches that are currently pending. - Waiman plugged a long-standing hole in workqueue CPU isolation where ordered workqueues didn't follow wq_unbound_cpumask updates. Ordered workqueues now follow the same rules as other unbound workqueues. - More CPU isolation improvements: Juri fixed another deficit in workqueue isolation where unbound rescuers don't respect wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on isolated CPUs. - Other misc changes" * tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits) workqueue: Drain BH work items on hot-unplugged CPUs workqueue: Introduce from_work() helper for cleaner callback declarations workqueue: Control intensive warning threshold through cmdline workqueue: Make @flags handling consistent across set_work_data() and friends workqueue: Remove clear_work_data() workqueue: Factor out work_grab_pending() from __cancel_work_sync() workqueue: Clean up enum work_bits and related constants workqueue: Introduce work_cancel_flags workqueue: Use variable name irq_flags for saving local irq flags workqueue: Reorganize flush and cancel[_sync] functions workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() workqueue: Cosmetic changes workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK workqueue: Fix queue_work_on() with BH workqueues async: Use a dedicated unbound workqueue with raised min_active workqueue: Implement workqueue_set_min_active() workqueue: Fix kernel-doc comment of unplug_oldest_pwq() workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask kernel/workqueue: Let rescuers follow unbound wq cpumask changes ...
2024-03-04modules: wait do_free_init correctlyChangbin Du1-2/+3
The synchronization here is to ensure the ordering of freeing of a module init so that it happens before W+X checking. It is worth noting it is not that the freeing was not happening, it is just that our sanity checkers raced against the permission checkers which assume init memory is already gone. Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling do_free_init() into a global workqueue instead of relying on it being called through call_rcu(..., do_free_init), which used to allowed us call do_free_init() asynchronously after the end of a subsequent grace period. The move to a global workqueue broke the gaurantees for code which needed to be sure the do_free_init() would complete with rcu_barrier(). To fix this callers which used to rely on rcu_barrier() must now instead use flush_work(&init_free_wq). Without this fix, we still could encounter false positive reports in W+X checking since the rcu_barrier() here can not ensure the ordering now. Even worse, the rcu_barrier() can introduce significant delay. Eric Chanudet reported that the rcu_barrier introduces ~0.1s delay on a PREEMPT_RT kernel. [ 0.291444] Freeing unused kernel memory: 5568K [ 0.402442] Run /sbin/init as init process With this fix, the above delay can be eliminated. Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag") Signed-off-by: Changbin Du <changbin.du@huawei.com> Tested-by: Eric Chanudet <echanude@redhat.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Xiaoyi Su <suxiaoyi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04smp: Consolidate smp_prepare_boot_cpu()Thomas Gleixner1-0/+4
There is no point in having seven architectures implementing the same empty stub. Provide a weak function in the init code and remove the stubs. This also allows to utilize the function on UP which is required to sanitize the per CPU handling on X86 UP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.567671691@linutronix.de
2024-03-01pidfd: add pidfsChristian Brauner1-0/+2
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-27smp: Provide 'setup_max_cpus' definition on UP tooIngo Molnar1-1/+0
This was already defined locally by init/main.c, but let's make it generic, as arch/x86/kernel/cpu/topology.c is going to make use of it to have more uniform code. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-22init: remove obsolete arch_call_rest_init() wrapperGeert Uytterhoeven1-7/+2
Since commit 3570ee046c46b5dc ("s390/smp: keep the original lowcore for CPU 0"), there is no longer any architecture that needs to override arch_call_rest_init(). Remove the weak wrapper around rest_init(), call rest_init() directly, and make rest_init() static. Link: https://lkml.kernel.org/r/aa10868bfb176eef4abb8bb4a710b85330792694.1706106183.git.geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WXChristophe Leroy1-0/+2
All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-09async: Use a dedicated unbound workqueue with raised min_activeTejun Heo1-0/+1
Async can schedule a number of interdependent work items. However, since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This default value is 8 which isn't sufficient for async and can lead to stalls during resume from suspend in some cases. Let's use a dedicated unbound workqueue with raised min_active. Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-02init: Declare rodata_enabled and mark_rodata_ro() at all timeChristophe Leroy1-14/+7
Declaring rodata_enabled and mark_rodata_ro() at all time helps removing related #ifdefery in C files. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-11-01proc: sysctl: prevent aliased sysctls from getting passed to initKrister Johansen1-0/+4
The code that checks for unknown boot options is unaware of the sysctl alias facility, which maps bootparams to sysctl values. If a user sets an old value that has a valid alias, a message about an invalid parameter will be printed during boot, and the parameter will get passed to init. Fix by checking for the existence of aliased parameters in the unknown boot parameter code. If an alias exists, don't return an error or pass the value to init. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Cc: stable@vger.kernel.org Fixes: 0a477e1ae21b ("kernel/sysctl: support handling command line aliases") Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-07workqueue: Initialize unbound CPU pods later in the bootTejun Heo1-0/+1
During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-28Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of ↵Linus Torvalds1-18/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level directories - Douglas Anderson has added a new "buddy" mode to the hardlockup detector. It permits the detector to work on architectures which cannot provide the required interrupts, by having CPUs periodically perform checks on other CPUs - Zhen Lei has enhanced kexec's ability to support two crash regions - Petr Mladek has done a lot of cleanup on the hard lockup detector's Kconfig entries - And the usual bunch of singleton patches in various places * tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) kernel/time/posix-stubs.c: remove duplicated include ocfs2: remove redundant assignment to variable bit_off watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h devres: show which resource was invalid in __devm_ioremap_resource() watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64 watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h watchdog/hardlockup: make the config checks more straightforward watchdog/hardlockup: sort hardlockup detector related config values a logical way watchdog/hardlockup: move SMP barriers from common code to buddy code watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY watchdog/buddy: don't copy the cpumask in watchdog_next_cpu() watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog() watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy() watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick() watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe() watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails ...
2023-06-27Merge tag 'objtool-core-2023-06-27' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molar: "Build footprint & performance improvements: - Reduce memory usage with CONFIG_DEBUG_INFO=y In the worst case of an allyesconfig+CONFIG_DEBUG_INFO=y kernel, DWARF creates almost 200 million relocations, ballooning objtool's peak heap usage to 53GB. These patches reduce that to 25GB. On a distro-type kernel with kernel IBT enabled, they reduce objtool's peak heap usage from 4.2GB to 2.8GB. These changes also improve the runtime significantly. Debuggability improvements: - Add the unwind_debug command-line option, for more extend unwinding debugging output - Limit unreachable warnings to once per function - Add verbose option for disassembling affected functions - Include backtrace in verbose mode - Detect missing __noreturn annotations - Ignore exc_double_fault() __noreturn warnings - Remove superfluous global_noreturns entries - Move noreturn function list to separate file - Add __kunit_abort() to noreturns Unwinder improvements: - Allow stack operations in UNWIND_HINT_UNDEFINED regions - drm/vmwgfx: Add unwind hints around RBP clobber Cleanups: - Move the x86 entry thunk restore code into thunk functions - x86/unwind/orc: Use swap() instead of open coding it - Remove unnecessary/unused variables Fixes for modern stack canary handling" * tag 'objtool-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) x86/orc: Make the is_callthunk() definition depend on CONFIG_BPF_JIT=y objtool: Skip reading DWARF section data objtool: Free insns when done objtool: Get rid of reloc->rel[a] objtool: Shrink elf hash nodes objtool: Shrink reloc->sym_reloc_entry objtool: Get rid of reloc->jump_table_start objtool: Get rid of reloc->addend objtool: Get rid of reloc->type objtool: Get rid of reloc->offset objtool: Get rid of reloc->idx objtool: Get rid of reloc->list objtool: Allocate relocs in advance for new rela sections objtool: Add for_each_reloc() objtool: Don't free memory in elf_close() objtool: Keep GElf_Rel[a] structs synced objtool: Add elf_create_section_pair() objtool: Add mark_sec_changed() objtool: Fix reloc_hash size objtool: Consolidate rel/rela handling ...
2023-06-16init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()Thomas Gleixner1-13/+0
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and remove the weak fallback from the core code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230613224545.670360645@linutronix.de
2023-06-16init: Invoke arch_cpu_finalize_init() earlierThomas Gleixner1-2/+2
X86 is reworking the boot process so that initializations which are not required during early boot can be moved into the late boot process and out of the fragile and restricted initial boot phase. arch_cpu_finalize_init() is the obvious place to do such initializations, but arch_cpu_finalize_init() is invoked too late in start_kernel() e.g. for initializing the FPU completely. fork_init() requires that the FPU is initialized as the size of task_struct on X86 depends on the size of the required FPU register buffer. Fortunately none of the init calls between calibrate_delay() and arch_cpu_finalize_init() is relevant for the functionality of arch_cpu_finalize_init(). Invoke it right after calibrate_delay() where everything which is relevant for arch_cpu_finalize_init() has been set up already. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20230613224545.612182854@linutronix.de
2023-06-16init: Remove check_bugs() leftoversThomas Gleixner1-5/+0
Everything is converted over to arch_cpu_finalize_init(). Remove the check_bugs() leftovers including the empty stubs in asm-generic, alpha, parisc, powerpc and xtensa. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Link: https://lore.kernel.org/r/20230613224545.553215951@linutronix.de
2023-06-16init: Provide arch_cpu_finalize_init()Thomas Gleixner1-0/+4
check_bugs() has become a dumping ground for all sorts of activities to finalize the CPU initialization before running the rest of the init code. Most are empty, a few do actual bug checks, some do alternative patching and some cobble a CPU advertisement string together.... Aside of that the current implementation requires duplicated function declaration and mostly empty header files for them. Provide a new function arch_cpu_finalize_init(). Provide a generic declaration if CONFIG_ARCH_HAS_CPU_FINALIZE_INIT is selected and a stub inline otherwise. This requires a temporary #ifdef in start_kernel() which will be removed along with check_bugs() once the architectures are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230613224544.957805717@linutronix.de
2023-06-09init: consolidate prototypes in linux/init.hArnd Bergmann1-18/+0
The init/main.c file contains some extern declarations for functions defined in architecture code, and it defines some other functions that are called from architecture code with a custom prototype. Both of those result in warnings with 'make W=1': init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes] init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes] arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes] arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes] arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes] init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] Add prototypes for all of these in include/linux/init.h or another appropriate header, and remove the duplicate declarations from architecture specific code. [sfr@canb.auug.org.au: declare time_init_early()] Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-16start_kernel: Omit prevent_tail_call_optimization() for newer toolchainsndesaulniers@google.com1-0/+6
prevent_tail_call_optimization() was added in commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try") to work around stack canaries getting inserted into functions that would initialize the stack canary in the first place. Now that we have no_stack_protector function attribute (gcc-11+, clang-7+) and use it on start_kernel(), remove the call to prevent_tail_call_optimization() such that we may one day remove it outright. Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230412-no_stackp-v2-2-116f9fe4bbe7@google.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-05-16start_kernel: Add __no_stack_protector function attributendesaulniers@google.com1-1/+2
Back during the discussion of commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try") we discussed the need for a function attribute to control the omission of stack protectors on a per-function basis; at the time Clang had support for no_stack_protector but GCC did not. This was fixed in gcc-11. Now that the function attribute is available, let's start using it. Callers of boot_init_stack_canary need to use this function attribute unless they're compiled with -fno-stack-protector, otherwise the canary stored in the stack slot of the caller will differ upon the call to boot_init_stack_canary. This will lead to a call to __stack_chk_fail() then panic. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722 Link: https://lore.kernel.org/all/20200316130414.GC12561@hirez.programming.kicks-ass.net/ Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230412-no_stackp-v2-1-116f9fe4bbe7@google.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: ndesaulniers@google.com <ndesaulniers@google.com>
2023-04-28Merge tag 'objtool-core-2023-04-27' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did this inconsistently follow this new, common convention, and fix all the fallout that objtool can now detect statically - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code - Generate ORC data for __pfx code - Add more __noreturn annotations to various kernel startup/shutdown and panic functions - Misc improvements & fixes * tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/hyperv: Mark hv_ghcb_terminate() as noreturn scsi: message: fusion: Mark mpt_halt_firmware() __noreturn x86/cpu: Mark {hlt,resume}_play_dead() __noreturn btrfs: Mark btrfs_assertfail() __noreturn objtool: Include weak functions in global_noreturns check cpu: Mark nmi_panic_self_stop() __noreturn cpu: Mark panic_smp_self_stop() __noreturn arm64/cpu: Mark cpu_park_loop() and friends __noreturn x86/head: Mark *_start_kernel() __noreturn init: Mark start_kernel() __noreturn init: Mark [arch_call_]rest_init() __noreturn objtool: Generate ORC data for __pfx code x86/linkage: Fix padding for typed functions objtool: Separate prefix code from stack validation code objtool: Remove superfluous dead_end_function() check objtool: Add symbol iteration helpers objtool: Add WARN_INSN() scripts/objdump-func: Support multiple functions context_tracking: Fix KCSAN noinstr violation objtool: Add stackleak instrumentation to uaccess safe list ...
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-72/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-25Merge tag 'x86_sev_for_v6.4_rc1' of ↵Linus Torvalds1-8/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the necessary glue so that the kernel can run as a confidential SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the address space in two parts: encrypted and unencrypted. The use case being running unmodified guests on the Hyper-V confidential computing hypervisor - Double-buffer messages between the guest and the hardware PSP device so that no partial buffers are copied back'n'forth and thus potential message integrity and leak attacks are possible - Name the return value the sev-guest driver returns when the hw PSP device hasn't been called, explicitly - Cleanups * tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hyperv: Change vTOM handling to use standard coco mechanisms init: Call mem_encrypt_init() after Hyper-V hypercall init is done x86/mm: Handle decryption/re-encryption of bss_decrypted consistently Drivers: hv: Explicitly request decrypted in vmap_pfn() calls x86/hyperv: Reorder code to facilitate future work x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM x86/sev: Change snp_guest_issue_request()'s fw_err argument virt/coco/sev-guest: Double-buffer messages crypto: ccp: Get rid of __sev_platform_init_locked()'s local function pointer crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALL
2023-04-24Merge tag 'v6.4/kernel.user_worker' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull user work thread updates from Christian Brauner: "This contains the work generalizing the ability to create a kernel worker from a userspace process. Such user workers will run with the same credentials as the userspace process they were created from providing stronger security and accounting guarantees than the traditional override_creds() approach ever could've hoped for. The original work was heavily based and optimzed for the needs of io_uring which was the first user. However, as it quickly turned out the ability to create user workers inherting properties from a userspace process is generally useful. The vhost subsystem currently creates workers using the kthread api. The consequences of using the kthread api are that RLIMITs don't work correctly as they are inherited from khtreadd. This leads to bugs where more workers are created than would be allowed by the RLIMITs of the userspace process in lieu of which workers are created. Problems like this disappear with user workers created from the userspace processes for which they perform the work. In addition, providing this api allows vhost to remove additional complexity. For example, cgroup and mm sharing will just work out of the box with user workers based on the relevant userspace process instead of manually ensuring the correct cgroup and mm contexts are used. So the vhost subsystem should simply be made to use the same mechanism as io_uring. To this end the original mechanism used for create_io_thread() is generalized into user workers: - Introduce PF_USER_WORKER as a generic indicator that a given task is a user worker, i.e., a kernel task that was created from a userspace process. Now a PF_IO_WORKER thread is just a specialized version of PF_USER_WORKER. So io_uring io workers raise both flags. - Make copy_process() available to core kernel code - Extend struct kernel_clone_args with the following bitfields allowing to indicate to copy_process(): - to create a user worker (raise PF_USER_WORKER) - to not inherit any files from the userspace process - to ignore signals After all generic changes are in place the vhost subsystem implements a new dedicated vhost api based on user workers. Finally, vhost is switched to rely on the new api moving it off of kthreads. Thanks to Mike for sticking it out and making it through this rather arduous journey" * tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: vhost: use vhost_tasks for worker threads vhost: move worker thread fields to new struct vhost_task: Allow vhost layer to use copy_process fork: allow kernel code to call copy_process fork: Add kernel_clone_args flag to ignore signals fork: add kernel_clone_args flag to not dup/clone files fork/vm: Move common PF_IO_WORKER behavior to new flag kernel: Make io_thread and kthread bit fields kthread: Pass in the thread's name during creation kernel: Allow a kernel thread's name to be set in copy_process csky: Remove kernel_thread declaration
2023-04-14init: Mark start_kernel() __noreturnJosh Poimboeuf1-1/+1
Now that arch_call_rest_init() is __noreturn, mark its caller start_kernel() __noreturn. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/7069acf026a195f26a88061227fba5a3b0337b9a.1681342859.git.jpoimboe@kernel.org
2023-04-14init: Mark [arch_call_]rest_init() __noreturnJosh Poimboeuf1-2/+2
In preparation for improving objtool's handling of weak noreturn functions, mark start_kernel(), arch_call_rest_init(), and rest_init() __noreturn. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/7194ed8a989a85b98d92e62df660f4a90435a723.1681342859.git.jpoimboe@kernel.org
2023-04-05init,mm: fold late call to page_ext_init() to page_alloc_init_late()Mike Rapoport (IBM)1-4/+0
When deferred initialization of struct pages is enabled, page_ext_init() must be called after all the deferred initialization is done, but there is no point to keep it a separate call from kernel_init_freeable() right after page_alloc_init_late(). Fold the call to page_ext_init() into page_alloc_init_late() and localize deferred_struct_pages variable. Link: https://lkml.kernel.org/r/20230321170513.2401534-11-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05init,mm: move mm_init() to mm/mm_init.c and rename it to mm_core_init()Mike Rapoport (IBM)1-69/+2
Make mm_init() a part of mm/ codebase. mm_core_init() better describes what the function does and does not clash with mm_init() in kernel/fork.c Link: https://lkml.kernel.org/r/20230321170513.2401534-8-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05init: fold build_all_zonelists() and page_alloc_init_cpuhp() to mm_init()Mike Rapoport (IBM)1-3/+4
Both build_all_zonelists() and page_alloc_init_cpuhp() must be called after SMP setup is complete but before the page allocator is set up. Still, they both are a part of memory management initialization, so move them to mm_init(). Link: https://lkml.kernel.org/r/20230321170513.2401534-7-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/page_alloc: rename page_alloc_init() to page_alloc_init_cpuhp()Mike Rapoport (IBM)1-1/+1
The page_alloc_init() name is really misleading because all this function does is sets up CPU hotplug callbacks for the page allocator. Rename it to page_alloc_init_cpuhp() so that name will reflect what the function does. Link: https://lkml.kernel.org/r/20230321170513.2401534-6-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-27init: Call mem_encrypt_init() after Hyper-V hypercall init is doneMichael Kelley1-8/+11
Full Hyper-V initialization, including support for hypercalls, is done as an apic_post_init callback via late_time_init(). mem_encrypt_init() needs to make hypercalls when it marks swiotlb memory as decrypted. But mem_encrypt_init() is currently called a few lines before late_time_init(), so the hypercalls don't work. Fix this by moving mem_encrypt_init() after late_time_init() and related clock initializations. The intervening initializations don't do any I/O that requires the swiotlb, so moving mem_encrypt_init() slightly later has no impact. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/1678329614-3482-6-git-send-email-mikelley@microsoft.com
2023-03-22bootconfig: Change message if no bootconfig with CONFIG_BOOT_CONFIG_FORCE=yMasami Hiramatsu (Google)1-3/+7
Change no bootconfig data error message if user do not specify 'bootconfig' option but CONFIG_BOOT_CONFIG_FORCE=y. With CONFIG_BOOT_CONFIG_FORCE=y, the kernel proceeds bootconfig check even if user does not specify 'bootconfig' option. So the current error message is confusing. Let's show just an information message to notice skipping the bootconfig in that case. Link: https://lore.kernel.org/all/167754610254.318944.16848412476667893329.stgit@devnote2/ Fixes: b743852ccc1d ("Allow forcing unconditional bootconfig processing") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/all/CAMuHMdV9jJvE2y8gY5V_CxidUikCf5515QMZHzTA3rRGEOj6=w@mail.gmail.com/ Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
2023-03-12kernel: Allow a kernel thread's name to be set in copy_processMike Christie1-1/+1
This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after the thread is created. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-22Allow forcing unconditional bootconfig processingPaul E. McKenney1-1/+1
The BOOT_CONFIG family of Kconfig options allows a bootconfig file containing kernel boot parameters to be embedded into an initrd or into the kernel itself. This can be extremely useful when deploying kernels in cases where some of the boot parameters depend on the kernel version rather than on the server hardware, firmware, or workload. Unfortunately, the "bootconfig" kernel parameter must be specified in order to cause the kernel to look for the embedded bootconfig file, and it clearly does not help to embed this "bootconfig" kernel parameter into that file. Therefore, provide a new BOOT_CONFIG_FORCE Kconfig option that causes the kernel to act as if the "bootconfig" kernel parameter had been specified. In other words, kernels built with CONFIG_BOOT_CONFIG_FORCE=y will look for the embedded bootconfig file even when the "bootconfig" kernel parameter is omitted. This permits kernel-version-dependent kernel boot parameters to be embedded into the kernel image without the need to (for example) update large numbers of boot loaders. Link: https://lore.kernel.org/all/20230105005838.GA1772817@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <linux-doc@vger.kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-02-02mm/page_ext: init page_ext early if there are no deferred struct pagesPasha Tatashin1-3/+3
page_ext must be initialized after all struct pages are initialized. Therefore, page_ext is initialized after page_alloc_init_late(), and can optionally be initialized earlier via early_page_ext kernel parameter which as a side effect also disables deferred struct pages. Allow to automatically init page_ext early when there are no deferred struct pages in order to be able to use page_ext during kernel boot and track for example page allocations early. [pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n] Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15x86/mm: Initialize text poking earlierPeter Zijlstra1-2/+1
Move poking_init() up a bunch; specifically move it right after mm_init() which is right before ftrace_init(). This will allow simplifying ftrace text poking which currently has a bunch of exceptions for early boot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.881703081@infradead.org
2022-12-15mm: Move mm_cachep initialization to mm_init()Peter Zijlstra1-0/+1
In order to allow using mm_alloc() much earlier, move initializing mm_cachep into mm_init(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org
2022-11-18proc: give /proc/cmdline sizeAlexey Dobriyan1-2/+5
Most /proc files don't have length (in fstat sense). This leads to inefficiencies when reading such files with APIs commonly found in modern programming languages. They open file, then fstat descriptor, get st_size == 0 and either assume file is empty or start reading without knowing target size. cat(1) does OK because it uses large enough buffer by default. But naive programs copy-pasted from SO aren't: let mut f = std::fs::File::open("/proc/cmdline").unwrap(); let mut buf: Vec<u8> = Vec::new(); f.read_to_end(&mut buf).unwrap(); will result in openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3 statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address) statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32 read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32 read(3, "apper/fedora_localhost--live-roo"..., 64) = 64 read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116 read(3, "", 12) open/stat is OK, lseek looks silly but there are 3 unnecessary reads because Rust starts with 32 bytes per Vec<u8> and grows from there. In case of /proc/cmdline, the length is known precisely. Make variables readonly while I'm at it. P.S.: I tried to scp /proc/cpuinfo today and got empty file but this is separate story. Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12Merge tag 'mm-nonmm-stable-2022-10-11' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds1-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-03init/main.c: remove unnecessary (void*) conversionsZhou jie1-2/+2
The void pointer object can be directly assigned to different structure objects, it does not need to be cast. Link: https://lkml.kernel.org/r/20220928014539.11046-1-zhoujie@nfschina.com Signed-off-by: Zhou jie <zhoujie@nfschina.com> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03init: kmsan: call KMSAN initialization routinesAlexander Potapenko1-0/+3
kmsan_init_shadow() scans the mappings created at boot time and creates metadata pages for those mappings. When the memblock allocator returns pages to pagealloc, we reserve 2/3 of those pages and use them as metadata for the remaining 1/3. Once KMSAN starts, every page allocated by pagealloc has its associated shadow and origin pages. kmsan_initialize() initializes the bookkeeping for init_task and enables KMSAN. Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-29random: split initialization into early step and later stepJason A. Donenfeld1-9/+8
The full RNG initialization relies on some timestamps, made possible with initialization functions like time_init() and timekeeping_init(). However, these are only available rather late in initialization. Meanwhile, other things, such as memory allocator functions, make use of the RNG much earlier. So split RNG initialization into two phases. We can provide arch randomness very early on, and then later, after timekeeping and such are available, initialize the rest. This ensures that, for example, slabs are properly randomized if RDRAND is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree of its security, because its random seed is potentially deterministic, since it hasn't yet incorporated RDRAND. It also makes it possible to use a better seed in kfence, which currently relies on only the cycle counter. Another positive consequence is that on systems with RDRAND, running with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all. One subtle side effect of this change is that on systems with no RDRAND, RDTSC is now only queried by random_init() once, committing the moment of the function call, instead of multiple times as before. This is intentional, as the multiple RDTSCs in a loop before weren't accomplishing very much, with jitter being better provided by try_to_generate_entropy(). Plus, filling blocks with RDTSC is still being done in extract_entropy(), which is necessarily called before random bytes are served anyway. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-26Maple Tree: add new data structureLiam R. Howlett1-0/+2
Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11init: move from strlcpy with unused retval to strscpyWolfram Sang1-2/+2
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220818210200.8203-1-wsa+renesas@sang-engineering.com Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11page_ext: introduce boot parameter 'early_page_ext'Li Zhe1-1/+5
In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in page_ext_init""), we call page_ext_init() after page_alloc_init_late() to avoid some panic problem. It seems that we cannot track early page allocations in current kernel even if page structure has been initialized early. This patch introduces a new boot parameter 'early_page_ext' to resolve this problem. If we pass it to the kernel, page_ext_init() will be moved up and the feature 'deferred initialization of struct pages' will be disabled to initialize the page allocator early and prevent the panic problem above. It can help us to catch early page allocations. This is useful especially when we find that the free memory value is not the same right after different kernel booting. [akpm@linux-foundation.org: fix section issue by removing __meminitdata] Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-23arm64: fix rodata=fullMark Rutland1-3/+15
On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-01stack: Declare {randomize_,}kstack_offset to fix Sparse warningsGONG, Ruiqi1-0/+1
Fix the following Sparse warnings that got noticed when the PPC-dev patchwork was checking another patch (see the link below): init/main.c:862:1: warning: symbol 'randomize_kstack_offset' was not declared. Should it be static? init/main.c:864:1: warning: symbol 'kstack_offset' was not declared. Should it be static? Which in fact are triggered on all architectures that have HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET support (for instances x86, arm64 etc). Link: https://lore.kernel.org/lkml/e7b0d68b-914d-7283-827c-101988923929@huawei.com/T/#m49b2d4490121445ce4bf7653500aba59eefcb67f Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220629060423.2515693-1-gongruiqi1@huawei.com
2022-06-03Merge tag 'kthread-cleanups-for-v5.19' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull kthread updates from Eric Biederman: "This updates init and user mode helper tasks to be ordinary user mode tasks. Commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") caused init and the user mode helper threads that call kernel_execve to have struct kthread allocated for them. This struct kthread going away during execve in turned made a use after free of struct kthread possible. Here, commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") is enough to fix the use after free and is simple enough to be backportable. The rest of the changes pass struct kernel_clone_args to clean things up and cause the code to make sense. In making init and the user mode helpers tasks purely user mode tasks I ran into two complications. The function task_tick_numa was detecting tasks without an mm by testing for the presence of PF_KTHREAD. The initramfs code in populate_initrd_image was using flush_delayed_fput to ensuere the closing of all it's file descriptors was complete, and flush_delayed_fput does not work in a userspace thread. I have looked and looked and more complications and in my code review I have not found any, and neither has anyone else with the code sitting in linux-next" * tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: sched: Update task_tick_numa to ignore tasks without an mm fork: Stop allowing kthreads to call execve fork: Explicitly set PF_KTHREAD init: Deal with the init process being a user mode process fork: Generalize PF_IO_WORKER handling fork: Explicity test for idle tasks in copy_thread fork: Pass struct kernel_clone_args into copy_thread kthread: Don't allocate kthread_struct for init and umh
2022-05-29Merge tag 'trace-v5.19' of ↵Linus Torvalds1-19/+19
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "The majority of the changes are for fixes and clean ups. Notable changes: - Rework trace event triggers code to be easier to interact with. - Support for embedding bootconfig with the kernel (as suppose to having it embedded in initram). This is useful for embedded boards without initram disks. - Speed up boot by parallelizing the creation of tracefs files. - Allow absolute ring buffer timestamps handle timestamps that use more than 59 bits. - Added new tracing clock "TAI" (International Atomic Time) - Have weak functions show up in available_filter_function list as: __ftrace_invalid_address___<invalid-offset> instead of using the name of the function before it" * tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits) ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function tracing: Fix comments for event_trigger_separate_filter() x86/traceponit: Fix comment about irq vector tracepoints x86,tracing: Remove unused headers ftrace: Clean up hash direct_functions on register failures tracing: Fix comments of create_filter() tracing: Disable kcov on trace_preemptirq.c tracing: Initialize integer variable to prevent garbage return value ftrace: Fix typo in comment ftrace: Remove return value of ftrace_arch_modify_*() tracing: Cleanup code by removing init "char *name" tracing: Change "char *" string form to "char []" tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ tracing/timerlat: Print stacktrace in the IRQ handler if needed tracing/timerlat: Notify IRQ new max latency only if stop tracing is set kprobes: Fix build errors with CONFIG_KRETPROBES=n tracing: Fix return value of trace_pid_write() tracing: Fix potential double free in create_var_ref() tracing: Use strim() to remove whitespace instead of doing it manually ftrace: Deal with error return code of the ftrace_process_locs() function ...
2022-05-18random: handle latent entropy and command line from random_init()Jason A. Donenfeld1-7/+3
Currently, start_kernel() adds latent entropy and the command line to the entropy bool *after* the RNG has been initialized, deferring when it's actually used by things like stack canaries until the next time the pool is seeded. This surely is not intended. Rather than splitting up which entropy gets added where and when between start_kernel() and random_init(), just do everything in random_init(), which should eliminate these kinds of bugs in the future. While we're at it, rename the awkwardly titled "rand_initialize()" to the more standard "random_init()" nomenclature. Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-13init: call time_init() before rand_initialize()Jason A. Donenfeld1-1/+2
Currently time_init() is called after rand_initialize(), but rand_initialize() makes use of the timer on various platforms, and sometimes this timer needs to be initialized by time_init() first. In order for random_get_entropy() to not return zero during early boot when it's potentially used as an entropy source, reverse the order of these two calls. The block doing random initialization was right before time_init() before, so changing the order shouldn't have any complicated effects. Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Stafford Horne <shorne@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-06kthread: Don't allocate kthread_struct for init and umhEric W. Biederman1-1/+1
If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from. This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task. Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set. Solve this by not allocating kthread_struct for the init process and the user mode helper processes. This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set. I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL. There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable. Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <maximkabox13@gmail.com> Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-04-26bootconfig: Support embedding a bootconfig file in kernelMasami Hiramatsu1-10/+12
This allows kernel developer to embed a default bootconfig file in the kernel instead of embedding it in the initrd. This will be good for who are using the kernel without initrd, or who needs a default bootconfigs. This needs to set two kconfigs: CONFIG_BOOT_CONFIG_EMBED=y and set the file path to CONFIG_BOOT_CONFIG_EMBED_FILE. Note that you still need 'bootconfig' command line option to load the embedded bootconfig. Also if you boot using an initrd with a different bootconfig, the kernel will use the bootconfig in the initrd, instead of the default bootconfig. Link: https://lkml.kernel.org/r/164921227943.1090670.14035119557571329218.stgit@devnote2 Cc: Padmanabha Srinivasaiah <treasure4paddy@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Linux Kbuild mailing list <linux-kbuild@vger.kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26bootconfig: Check the checksum before removing the bootconfig from initrdMasami Hiramatsu1-12/+10
Check the bootconfig's checksum before removing the bootconfig data from initrd to avoid modifying initrd by mistake. This will also simplifies the get_boot_config_from_initrd() interface. Link: https://lkml.kernel.org/r/164921226891.1090670.16955839243639298134.stgit@devnote2 Cc: Padmanabha Srinivasaiah <treasure4paddy@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Linux Kbuild mailing list <linux-kbuild@vger.kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-8/+6
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-24Merge tag 'net-next-5.18' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "The sprinkling of SPI drivers is because we added a new one and Mark sent us a SPI driver interface conversion pull request. Core ---- - Introduce XDP multi-buffer support, allowing the use of XDP with jumbo frame MTUs and combination with Rx coalescing offloads (LRO). - Speed up netns dismantling (5x) and lower the memory cost a little. Remove unnecessary per-netns sockets. Scope some lists to a netns. Cut down RCU syncing. Use batch methods. Allow netdev registration to complete out of order. - Support distinguishing timestamp types (ingress vs egress) and maintaining them across packet scrubbing points (e.g. redirect). - Continue the work of annotating packet drop reasons throughout the stack. - Switch netdev error counters from an atomic to dynamically allocated per-CPU counters. - Rework a few preempt_disable(), local_irq_save() and busy waiting sections problematic on PREEMPT_RT. - Extend the ref_tracker to allow catching use-after-free bugs. BPF --- - Introduce "packing allocator" for BPF JIT images. JITed code is marked read only, and used to be allocated at page granularity. Custom allocator allows for more efficient memory use, lower iTLB pressure and prevents identity mapping huge pages from getting split. - Make use of BTF type annotations (e.g. __user, __percpu) to enforce the correct probe read access method, add appropriate helpers. - Convert the BPF preload to use light skeleton and drop the user-mode-driver dependency. - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling its use as a packet generator. - Allow local storage memory to be allocated with GFP_KERNEL if called from a hook allowed to sleep. - Introduce fprobe (multi kprobe) to speed up mass attachment (arch bits to come later). - Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra. - Allow cgroup BPF progs to return custom errors to user space. - Add support for AF_UNIX iterator batching. - Allow iterator programs to use sleepable helpers. - Support JIT of add, and, or, xor and xchg atomic ops on arm64. - Add BTFGen support to bpftool which allows to use CO-RE in kernels without BTF info. - Large number of libbpf API improvements, cleanups and deprecations. Protocols --------- - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev. - Adjust TSO packet sizes based on min_rtt, allowing very low latency links (data centers) to always send full-sized TSO super-frames. - Make IPv6 flow label changes (AKA hash rethink) more configurable, via sysctl and setsockopt. Distinguish between server and client behavior. - VxLAN support to "collect metadata" devices to terminate only configured VNIs. This is similar to VLAN filtering in the bridge. - Support inserting IPv6 IOAM information to a fraction of frames. - Add protocol attribute to IP addresses to allow identifying where given address comes from (kernel-generated, DHCP etc.) - Support setting socket and IPv6 options via cmsg on ping6 sockets. - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS. Define dscp_t and stop taking ECN bits into account in fib-rules. - Add support for locked bridge ports (for 802.1X). - tun: support NAPI for packets received from batched XDP buffs, doubling the performance in some scenarios. - IPv6 extension header handling in Open vSwitch. - Support IPv6 control message load balancing in bonding, prevent neighbor solicitation and advertisement from using the wrong port. Support NS/NA monitor selection similar to existing ARP monitor. - SMC - improve performance with TCP_CORK and sendfile() - support auto-corking - support TCP_NODELAY - MCTP (Management Component Transport Protocol) - add user space tag control interface - I2C binding driver (as specified by DMTF DSP0237) - Multi-BSSID beacon handling in AP mode for WiFi. - Bluetooth: - handle MSFT Monitor Device Event - add MGMT Adv Monitor Device Found/Lost events - Multi-Path TCP: - add support for the SO_SNDTIMEO socket option - lots of selftest cleanups and improvements - Increase the max PDU size in CAN ISOTP to 64 kB. Driver API ---------- - Add HW counters for SW netdevs, a mechanism for devices which offload packet forwarding to report packet statistics back to software interfaces such as tunnels. - Select the default NIC queue count as a fraction of number of physical CPU cores, instead of hard-coding to 8. - Expose devlink instance locks to drivers. Allow device layer of drivers to use that lock directly instead of creating their own which always runs into ordering issues in devlink callbacks. - Add header/data split indication to guide user space enabling of TCP zero-copy Rx. - Allow configuring completion queue event size. - Refactor page_pool to enable fragmenting after allocation. - Add allocation and page reuse statistics to page_pool. - Improve Multiple Spanning Trees support in the bridge to allow reuse of topologies across VLANs, saving HW resources in switches. - DSA (Distributed Switch Architecture): - replay and offload of host VLAN entries - offload of static and local FDB entries on LAG interfaces - FDB isolation and unicast filtering New hardware / drivers ---------------------- - Ethernet: - LAN937x T1 PHYs - Davicom DM9051 SPI NIC driver - Realtek RTL8367S, RTL8367RB-VB switch and MDIO - Microchip ksz8563 switches - Netronome NFP3800 SmartNICs - Fungible SmartNICs - MediaTek MT8195 switches - WiFi: - mt76: MediaTek mt7916 - mt76: MediaTek mt7921u USB adapters - brcmfmac: Broadcom BCM43454/6 - Mobile: - iosm: Intel M.2 7360 WWAN card Drivers ------- - Convert many drivers to the new phylink API built for split PCS designs but also simplifying other cases. - Intel Ethernet NICs: - add TTY for GNSS module for E810T device - improve AF_XDP performance - GTP-C and GTP-U filter offload - QinQ VLAN support - Mellanox Ethernet NICs (mlx5): - support xdp->data_meta - multi-buffer XDP - offload tc push_eth and pop_eth actions - Netronome Ethernet NICs (nfp): - flow-independent tc action hardware offload (police / meter) - AF_XDP - Other Ethernet NICs: - at803x: fiber and SFP support - xgmac: mdio: preamble suppression and custom MDC frequencies - r8169: enable ASPM L1.2 if system vendor flags it as safe - macb/gem: ZynqMP SGMII - hns3: add TX push mode - dpaa2-eth: software TSO - lan743x: multi-queue, mdio, SGMII, PTP - axienet: NAPI and GRO support - Mellanox Ethernet switches (mlxsw): - source and dest IP address rewrites - RJ45 ports - Marvell Ethernet switches (prestera): - basic routing offload - multi-chain TC ACL offload - NXP embedded Ethernet switches (ocelot & felix): - PTP over UDP with the ocelot-8021q DSA tagging protocol - basic QoS classification on Felix DSA switch using dcbnl - port mirroring for ocelot switches - Microchip high-speed industrial Ethernet (sparx5): - offloading of bridge port flooding flags - PTP Hardware Clock - Other embedded switches: - lan966x: PTP Hardward Clock - qca8k: mdio read/write operations via crafted Ethernet packets - Qualcomm 802.11ax WiFi (ath11k): - add LDPC FEC type and 802.11ax High Efficiency data in radiotap - enable RX PPDU stats in monitor co-exist mode - Intel WiFi (iwlwifi): - UHB TAS enablement via BIOS - band disablement via BIOS - channel switch offload - 32 Rx AMPDU sessions in newer devices - MediaTek WiFi (mt76): - background radar detection - thermal management improvements on mt7915 - SAR support for more mt76 platforms - MBSSID and 6 GHz band on mt7915 - RealTek WiFi: - rtw89: AP mode - rtw89: 160 MHz channels and 6 GHz band - rtw89: hardware scan - Bluetooth: - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS) - Microchip CAN (mcp251xfd): - multiple RX-FIFOs and runtime configurable RX/TX rings - internal PLL, runtime PM handling simplification - improve chip detection and error handling after wakeup" * tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits) llc: fix netdevice reference leaks in llc_ui_bind() drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool ice: don't allow to run ice_send_event_to_aux() in atomic ctx ice: fix 'scheduling while atomic' on aux critical err interrupt net/sched: fix incorrect vlan_push_eth dest field net: bridge: mst: Restrict info size queries to bridge ports net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init() drivers: net: xgene: Fix regression in CRC stripping net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT net: dsa: fix missing host-filtered multicast addresses net/mlx5e: Fix build warning, detected write beyond size of field iwlwifi: mvm: Don't fail if PPAG isn't supported selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" netdevice: add missing dm_private kdoc net: bridge: mst: prevent NULL deref in br_mst_info_size() selftests: forwarding: Use same VRF for port and VLAN upper ...
2022-03-23init/main.c: return 1 from handled __setup() functionsRandy Dunlap1-2/+4
initcall_blacklist() should return 1 to indicate that it handled its cmdline arguments. set_debug_rodata() should return 1 to indicate that it handled its cmdline arguments. Print a warning if the option string is invalid. This prevents these strings from being added to the 'init' program's environment as they are not init arguments/parameters. Link: https://lkml.kernel.org/r/20220221050901.23985-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Cc: Ingo Molnar <mingo@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23init: use ktime_us_delta() to make initcall_debug log more preciseMark-PK Tsai1-6/+2
Use ktime_us_delta() to make the initcall_debug log more precise than right shifting the result of ktime_to_ns() by 10 bits. Link: https://lkml.kernel.org/r/20220209053350.15771-1-mark-pk.tsai@mediatek.com Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Andrew Halaney <ahalaney@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Kees Cook <keescook@chromium.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: YJ Chiang <yj.chiang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-14stack: Introduce CONFIG_RANDOMIZE_KSTACK_OFFSETMarco Elver1-1/+1
The randomize_kstack_offset feature is unconditionally compiled in when the architecture supports it. To add constraints on compiler versions, we require a dedicated Kconfig variable. Therefore, introduce RANDOMIZE_KSTACK_OFFSET. Furthermore, this option is now also configurable by EXPERT kernels: while the feature is supposed to have zero performance overhead when disabled, due to its use of static branches, there are few cases where giving a distribution the option to disable the feature entirely makes sense. For example, in very resource constrained environments, which would never enable the feature to begin with, in which case the additional kernel code size increase would be redundant. Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220131090521.1947110-1-elver@google.com
2022-02-06net: initialize init_net earlierEric Dumazet1-0/+2
While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized. This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS); This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net) We had too many prior patches [1] trying to work around the root cause. Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time. Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init() As a bonus, this patch saves around 4KB in .data section. [1] f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace") b5082df8019a ("net: Initialise init_net.count to 1") 734b65417b24 ("net: Statically initialize init_net.dev_base_head") v2: fixed a build error reported by kernel build bots (CONFIG_NET=n) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-22lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()Vlastimil Babka1-3/+6
Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Marco Elver <elver@google.com> # stackdepot Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> From: Colin Ian King <colin.king@canonical.com> Subject: lib/stackdepot: fix spelling mistake and grammar in pr_err message There is a spelling mistake of the work allocation so fix this and re-phrase the message to make it easier to read. Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Vlastimil Babka <vbabka@suse.cz> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup On FLATMEM, we call page_ext_init_flatmem_late() just before kmem_cache_init() which means stack_depot_init() (called by page owner init) will not recognize properly it should use kvmalloc() and not memblock_alloc(). memblock_alloc() will also not issue a warning and return a block memory that can be invalid and cause kernel page fault when saving stacks, as reported by the kernel test robot [1]. Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so that slab_is_available() is true during stack_depot_init(). SPARSEMEM doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(), but a different page_ext_init() even later in the boot process. Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue. While at it, also actually resolve a checkpatch warning in stack_depot_init() from DRM CI, which was supposed to be in the original patch already. [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: kernel test robot <oliver.sang@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3 Due to cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") landing recently to -next adding a new stack depot user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup4 Due to 4e66934eaadc ("lib: add reference counting tracking infrastructure") landing recently to net-next adding a new stack depot user in lib/ref_tracker.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/45c1b738-1a2f-5b5f-2f6d-86fab206d01c@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Slab <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+3
Merge more updates from Andrew Morton: "87 patches. Subsystems affected by this patch series: mm (pagecache and hugetlb), procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs, init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork, sysvfs, kcov, gdb, resource, selftests, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits) ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL ipc: check checkpoint_restore_ns_capable() to modify C/R proc files selftests/kselftest/runner/run_one(): allow running non-executable files virtio-mem: disallow mapping virtio-mem memory via /dev/mem kernel/resource: disallow access to exclusive system RAM regions kernel/resource: clean up and optimize iomem_is_exclusive() scripts/gdb: handle split debug for vmlinux kcov: replace local_irq_save() with a local_lock_t kcov: avoid enable+disable interrupts if !in_task() kcov: allocate per-CPU memory on the relevant node Documentation/kcov: define `ip' in the example Documentation/kcov: include types.h in the example sysv: use BUILD_BUG_ON instead of runtime check kernel/fork.c: unshare(): use swap() to make code cleaner seq_file: fix passing wrong private data seq_file: move seq_escape() to a header signal: remove duplicate include in signal.h crash_dump: remove duplicate include in crash_dump.h crash_dump: fix boolreturn.cocci warning hfs/hfsplus: use WARN_ON for sanity check ...
2021-11-09init: make unknown command line param message clearerAndrew Halaney1-1/+3
The prior message is confusing users, which is the exact opposite of the goal. If the message is being seen, one of the following situations is happening: 1. the param is misspelled 2. the param is not valid due to the kernel configuration 3. the param is intended for init but isn't after the '--' delineator on the command line To make that more clear to the user, explicitly mention "kernel command line" and also note that the params are still passed to user space to avoid causing any alarm over params intended for init. Link: https://lkml.kernel.org/r/20211013223502.96756-1-ahalaney@redhat.com Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+4
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06memblock: use memblock_free for freeing virtual pointersMike Rapoport1-2/+2
Rename memblock_free_ptr() to memblock_free() and use memblock_free() when freeing a virtual pointer so that memblock_free() will be a counterpart of memblock_alloc() The callers are updated with the below semantic patch and manual addition of (void *) casting to pointers that are represented by unsigned long variables. @@ identifier vaddr; expression size; @@ ( - memblock_phys_free(__pa(vaddr), size); + memblock_free(vaddr, size); | - memblock_free_ptr(vaddr, size); + memblock_free(vaddr, size); ) [sfr@canb.auug.org.au: fixup] Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Juergen Gross <jgross@suse.com> Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: create a new system state and fix core_kernel_text()Christophe Leroy1-0/+2
core_kernel_text() considers that until system_state in at least SYSTEM_RUNNING, init memory is valid. But init memory is freed a few lines before setting SYSTEM_RUNNING, so we have a small period of time when core_kernel_text() is wrong. Create an intermediate system state called SYSTEM_FREEING_INIT that is set before starting freeing init memory, and use it in core_kernel_text() to report init memory invalid earlier. Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-01Merge tag 'trace-v5.16' of ↵Linus Torvalds1-12/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - kprobes: Restructured stack unwinder to show properly on x86 when a stack dump happens from a kretprobe callback. - Fix to bootconfig parsing - Have tracefs allow owner and group permissions by default (only denying others). There's been pressure to allow non root to tracefs in a controlled fashion, and using groups is probably the safest. - Bootconfig memory managament updates. - Bootconfig clean up to have the tools directory be less dependent on changes in the kernel tree. - Allow perf to be traced by function tracer. - Rewrite of function graph tracer to be a callback from the function tracer instead of having its own trampoline (this change will happen on an arch by arch basis, and currently only x86_64 implements it). - Allow multiple direct trampolines (bpf hooks to functions) be batched together in one synchronization. - Allow histogram triggers to add variables that can perform calculations against the event's fields. - Use the linker to determine architecture callbacks from the ftrace trampoline to allow for proper parameter prototypes and prevent warnings from the compiler. - Extend histogram triggers to key off of variables. - Have trace recursion use bit magic to determine preempt context over if branches. - Have trace recursion disable preemption as all use cases do anyway. - Added testing for verification of tracing utilities. - Various small clean ups and fixes. * tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits) tracing/histogram: Fix semicolon.cocci warnings tracing/histogram: Fix documentation inline emphasis warning tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together tracing: Show size of requested perf buffer bootconfig: Initialize ret in xbc_parse_tree() ftrace: do CPU checking after preemption disabled ftrace: disable preemption when recursion locked tracing/histogram: Document expression arithmetic and constants tracing/histogram: Optimize division by a power of 2 tracing/histogram: Covert expr to const if both operands are constants tracing/histogram: Simplify handling of .sym-offset in expressions tracing: Fix operator precedence for hist triggers expression tracing: Add division and multiplication support for hist triggers tracing: Add support for creating hist trigger variables from literal selftests/ftrace: Stop tracing while reading the trace file by default MAINTAINERS: Update KPROBES and TRACING entries test_kprobes: Move it from kernel/ to lib/ docs, kprobes: Remove invalid URL and add new reference samples/kretprobes: Fix return value if register_kretprobe() failed lib/bootconfig: Fix the xbc_get_info kerneldoc ...
2021-10-18block: move elevator.h to block/Christoph Hellwig1-1/+0
Except for the features passed to blk_queue_required_elevator_features, elevator.h is only needed internally to the block layer. Move the ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to block/, dropping all the spurious includes outside of that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-10bootconfig: init: Fix memblock leak in xbc_make_cmdline()Masami Hiramatsu1-0/+1
Free unused memblock in a error case to fix memblock leak in xbc_make_cmdline(). Link: https://lkml.kernel.org/r/163177339181.682366.8713781325929549256.stgit@devnote2 Fixes: 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig for kernel command line") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-10bootconfig: Rename xbc_destroy_all() to xbc_exit()Masami Hiramatsu1-1/+1
Avoid using this noisy name and use more calm one. This is just a name change. No functional change. Link: https://lkml.kernel.org/r/163187295918.2366983.5231840238429996027.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-10bootconfig: Add xbc_get_info() for the node informationMasami Hiramatsu1-0/+1
Add xbc_get_info() API which allows user to get the number of used xbc_nodes and the size of bootconfig data. This is also useful for checking the bootconfig is initialized or not. Link: https://lkml.kernel.org/r/163177340877.682366.4360676589783197627.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-10bootconfig: Allocate xbc_data inside xbc_init()Masami Hiramatsu1-11/+2
Allocate 'xbc_data' in the xbc_init() so that it does not need to care about the ownership of the copied data. Link: https://lkml.kernel.org/r/163177339986.682366.898762699429769117.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-22init: Revert accidental changes to print irqs_disabled()Geert Uytterhoeven1-3/+3
Commit f8ade8dddb16 ("xsurf100: drop include of lib8390.c") accidentally changed init/main.c. Revert that part. Fixes: f8ade8dddb16 ("xsurf100: drop include of lib8390.c") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-14memblock: introduce saner 'memblock_free_ptr()' interfaceLinus Torvalds1-1/+1
The boot-time allocation interface for memblock is a mess, with 'memblock_alloc()' returning a virtual pointer, but then you are supposed to free it with 'memblock_free()' that takes a _physical_ address. Not only is that all kinds of strange and illogical, but it actually causes bugs, when people then use it like a normal allocation function, and it fails spectacularly on a NULL pointer: https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/ or just random memory corruption if the debug checks don't catch it: https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/ I really don't want to apply patches that treat the symptoms, when the fundamental cause is this horribly confusing interface. I started out looking at just automating a sane replacement sequence, but because of this mix or virtual and physical addresses, and because people have used the "__pa()" macro that can take either a regular kernel pointer, or just the raw "unsigned long" address, it's all quite messy. So this just introduces a new saner interface for freeing a virtual address that was allocated using 'memblock_alloc()', and that was kept as a regular kernel pointer. And then it converts a couple of users that are obvious and easy to test, including the 'xbc_nodes' case in lib/bootconfig.c that caused problems. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed") Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-09Merge tag 'trace-v5.15-2' of ↵Linus Torvalds1-11/+26
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing updates from Steven Rostedt: - Add migrate-disable counter to tracing header - Fix error handling in event probes - Fix missed unlock in osnoise in error path - Fix merge issue with tools/bootconfig - Clean up bootconfig data when init memory is removed - Fix bootconfig to loop only on subkeys - Have kernel command lines override bootconfig options - Increase field counts for synthetic events - Have histograms dynamic allocate event elements to save space - Fixes in testing and documentation * tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/boot: Fix to loop on only subkeys selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events tracing: Dynamically allocate the per-elt hist_elt_data array tracing: synth events: increase max fields count tools/bootconfig: Show whole test command for each test case bootconfig: Fix missing return check of xbc_node_compose_key function tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh docs: bootconfig: Add how to use bootconfig for kernel parameters init/bootconfig: Reorder init parameter from bootconfig and cmdline init: bootconfig: Remove all bootconfig data when the init memory is removed tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads() tracing: Fix some alloc_event_probe() error handling bugs tracing: Add migrate-disabled counter to tracing output.
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08init/bootconfig: Reorder init parameter from bootconfig and cmdlineMasami Hiramatsu1-9/+14
Reorder the init parameters from bootconfig and kernel cmdline so that the kernel cmdline always be the last part of the parameters as below. " -- "[bootconfig init params][cmdline init params] This change will help us to prevent that bootconfig init params overwrite the init params which user gives in the command line. Link: https://lkml.kernel.org/r/163077085675.222577.5665176468023636160.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-08init: bootconfig: Remove all bootconfig data when the init memory is removedMasami Hiramatsu1-2/+12
Since the bootconfig is used only in the init functions, it doesn't need to keep the data after boot. Free it when the init memory is removed. Link: https://lkml.kernel.org/r/163077084958.222577.5924961258513004428.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-09-08trap: cleanup trap_init()Kefeng Wang1-0/+2
There are some empty trap_init() definitions in different ARCHs, Introduce a new weak trap_init() function to clean them up. Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm32] Acked-by: Vineet Gupta [arc] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <palmerdabbelt@google.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08init: move usermodehelper_enable() to populate_rootfs()Rasmus Villemoes1-1/+0
Currently, usermodehelper is enabled right before PID1 starts going through the initcalls. However, any call of a usermodehelper from a pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as there is no filesystem contents yet. Up until commit e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously"), such calls, whether via some request_module(), a legacy uevent "/sbin/hotplug" notification or something else, would just fail silently with (presumably) -ENOENT from kernel_execve(). However, that commit introduced the wait_for_initramfs() synchronization hook which must be called from the usermodehelper exec path right before the kernel_execve, in order that request_module() et al done from *after* rootfs_initcall() time (i.e. device_ and late_ initcalls) would continue to find a populated initramfs as they used to. Any call of wait_for_initramfs() done before the unpacking has been scheduled (i.e. before rootfs_initcall time) must just return immediately [and let the caller find an empty file system] in order not to deadlock the machine. I mistakenly thought, and my limited testing confirmed, that there were no such calls, so I added a pr_warn_once() in wait_for_initramfs(). It turns out that one can indeed hit request_module() as well as kobject_uevent_env() during those early init calls, leading to a user-visible warning in the kernel log emitted consistently for certain configurations. We could just remove the pr_warn_once(), but I think it's better to postpone enabling the usermodehelper framework until there is at least some chance of finding the executable. That is also a little more efficient in that a lot of work done in umh.c will be elided. However, it does change the error seen by those early callers from -ENOENT to -EBUSY, so there is a risk of a regression if any caller care about the exact error value. Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk Fixes: e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+7
drivers/ptp/Kconfig: 55c8fca1dae1 ("ptp_pch: Restore dependency on PCI") e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-12init: Suppress wrong warning for bootconfig cmdline parameterMasami Hiramatsu1-2/+7
Since the 'bootconfig' command line parameter is handled before parsing the command line, it doesn't use early_param(). But in this case, kernel shows a wrong warning message about it. [ 0.013714] Kernel command line: ro console=ttyS0 bootconfig console=tty0 [ 0.013741] Unknown command line parameters: bootconfig To suppress this message, add a dummy handler for 'bootconfig'. Link: https://lkml.kernel.org/r/162812945097.77369.1849780946468010448.stgit@devnote2 Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-03xsurf100: drop include of lib8390.cMichael Schmitz1-3/+3
Now that ax88796.c exports the ax_NS8390_reinit() symbol, we can include 8390.h instead of lib8390.c, avoiding duplication of that function and killing a few compile warnings in the bargain. Fixes: 861928f4e60e826c ("net-next: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)") Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08buildid: stash away kernels build ID on initStephen Boyd1-0/+2
Parse the kernel's build ID at initialization so that other code can print a hex format string representation of the running kernel's build ID. This will be used in the kdump and dump_stack code so that developers can easily locate the vmlinux debug symbols for a crash/stacktrace. [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()] Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org Signed-off-by: Stephen Boyd <swboyd@chromium.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Evan Green <evgreen@chromium.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sasha Levin <sashal@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-04Merge branch 'core-rcu-2021.07.04' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Bitmap parsing support for "all" as an alias for all bits - Documentation updates - Miscellaneous fixes, including some that overlap into mm and lockdep - kvfree_rcu() updates - mem_dump_obj() updates, with acks from one of the slab-allocator maintainers - RCU NOCB CPU updates, including limited deoffloading - SRCU updates - Tasks-RCU updates - Torture-test updates * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits) tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states rcu: Add missing __releases() annotation rcu: Remove obsolete rcu_read_unlock() deadlock commentary rcu: Improve comments describing RCU read-side critical sections rcu: Create an unrcu_pointer() to remove __rcu from a pointer srcu: Early test SRCU polling start rcu: Fix various typos in comments rcu/nocb: Unify timers rcu/nocb: Prepare for fine-grained deferred wakeup rcu/nocb: Only cancel nocb timer if not polling rcu/nocb: Delete bypass_timer upon nocb_gp wakeup rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup rcu/nocb: Allow de-offloading rdp leader rcu/nocb: Directly call __wake_nocb_gp() from bypass timer rcu: Don't penalize priority boosting when there is nothing to boost rcu: Point to documentation of ordering guarantees rcu: Make rcu_gp_cleanup() be noinline for tracing rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP ...
2021-07-03Merge tag 'trace-v5.14' of ↵Linus Torvalds1-11/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added option for per CPU threads to the hwlat tracer - Have hwlat tracer handle hotplug CPUs - New tracer: osnoise, that detects latency caused by interrupts, softirqs and scheduling of other tasks. - Added timerlat tracer that creates a thread and measures in detail what sources of latency it has for wake ups. - Removed the "success" field of the sched_wakeup trace event. This has been hardcoded as "1" since 2015, no tooling should be looking at it now. If one exists, we can revert this commit, fix that tool and try to remove it again in the future. - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids. - New boot command line option "tp_printk_stop", as tp_printk causes trace events to write to console. When user space starts, this can easily live lock the system. Having a boot option to stop just after boot up is useful to prevent that from happening. - Have ftrace_dump_on_oops boot command line option take numbers that match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops. - Bootconfig clean ups, fixes and enhancements. - New ktest script that tests bootconfig options. - Add tracepoint_probe_register_may_exist() to register a tracepoint without triggering a WARN*() if it already exists. BPF has a path from user space that can do this. All other paths are considered a bug. - Small clean ups and fixes * tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits) tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT tracing: Simplify & fix saved_tgids logic treewide: Add missing semicolons to __assign_str uses tracing: Change variable type as bool for clean-up trace/timerlat: Fix indentation on timerlat_main() trace/osnoise: Make 'noise' variable s64 in run_osnoise() tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing tracing: Fix spelling in osnoise tracer "interferences" -> "interference" Documentation: Fix a typo on trace/osnoise-tracer trace/osnoise: Fix return value on osnoise_init_hotplug_support trace/osnoise: Make interval u64 on osnoise_main trace/osnoise: Fix 'no previous prototype' warnings tracing: Have osnoise_main() add a quiescent state for task rcu seq_buf: Make trace_seq_putmem_hex() support data longer than 8 seq_buf: Fix overflow in seq_buf_putmem_hex() trace/osnoise: Support hotplug operations trace/hwlat: Support hotplug operations trace/hwlat: Protect kdata->kthread with get/put_online_cpus trace: Add timerlat tracer trace: Add osnoise tracer ...
2021-07-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+42
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...
2021-07-01init: print out unknown kernel parametersAndrew Halaney1-0/+42
It is easy to foobar setting a kernel parameter on the command line without realizing it, there's not much output that you can use to assess what the kernel did with that parameter by default. Make it a little more explicit which parameters on the command line _looked_ like a valid parameter for the kernel, but did not match anything and ultimately got tossed to init. This is very similar to the unknown parameter message received when loading a module. This assumes the parameters are processed in a normal fashion, some parameters (dyndbg= for example) don't register their parameter with the rest of the kernel's parameters, and therefore always show up in this list (and are also given to init - like the rest of this list). Another example is BOOT_IMAGE= is highlighted as an offender, which it technically is, but is passed by LILO and GRUB so most systems will see that complaint. An example output where "foobared" and "unrecognized" are intentionally invalid parameters: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-18Merge branch 'sched/urgent' into sched/core, to resolve conflictsIngo Molnar1-1/+1
This commit in sched/urgent moved the cfs_rq_is_decayed() function: a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: 9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-10bootconfig: Share the checksum function with toolsMasami Hiramatsu1-11/+1
Move the checksum calculation function into the header for sharing it with tools/bootconfig. Link: https://lkml.kernel.org/r/162262197470.264090.16325743685807878807.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-05pid: take a reference when initializing `cad_pid`Mark Rutland1-1/+1
During boot, kernel_init_freeable() initializes `cad_pid` to the init task's struct pid. Later on, we may change `cad_pid` via a sysctl, and when this happens proc_do_cad_pid() will increment the refcount on the new pid via get_pid(), and will decrement the refcount on the old pid via put_pid(). As we never called get_pid() when we initialized `cad_pid`, we decrement a reference we never incremented, can therefore free the init task's struct pid early. As there can be dangling references to the struct pid, we can later encounter a use-after-free (e.g. when delivering signals). This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to have been around since the conversion of `cad_pid` to struct pid in commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the pre-KASAN stone age of v2.6.19. Fix this by getting a reference to the init task's struct pid when we assign it to `cad_pid`. Full KASAN splat below. ================================================================== BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline] BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509 Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273 CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1 Hardware name: linux,dummy-virt (DT) Call trace: ns_of_pid include/linux/pid.h:153 [inline] task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509 do_notify_parent+0x308/0xe60 kernel/signal.c:1950 exit_notify kernel/exit.c:682 [inline] do_exit+0x2334/0x2bd0 kernel/exit.c:845 do_group_exit+0x108/0x2c8 kernel/exit.c:922 get_signal+0x4e4/0x2a88 kernel/signal.c:2781 do_signal arch/arm64/kernel/signal.c:882 [inline] do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936 work_pending+0xc/0x2dc Allocated by task 0: slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516 slab_alloc_node mm/slub.c:2907 [inline] slab_alloc mm/slub.c:2915 [inline] kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920 alloc_pid+0xdc/0xc00 kernel/pid.c:180 copy_process+0x2794/0x5e18 kernel/fork.c:2129 kernel_clone+0x194/0x13c8 kernel/fork.c:2500 kernel_thread+0xd4/0x110 kernel/fork.c:2552 rest_init+0x44/0x4a0 init/main.c:687 arch_call_rest_init+0x1c/0x28 start_kernel+0x520/0x554 init/main.c:1064 0x0 Freed by task 270: slab_free_hook mm/slub.c:1562 [inline] slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600 slab_free mm/slub.c:3161 [inline] kmem_cache_free+0x224/0x8e0 mm/slub.c:3177 put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114 put_pid+0x30/0x48 kernel/pid.c:109 proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401 proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591 proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617 call_write_iter include/linux/fs.h:1977 [inline] new_sync_write+0x3ac/0x510 fs/read_write.c:518 vfs_write fs/read_write.c:605 [inline] vfs_write+0x9c4/0x1018 fs/read_write.c:585 ksys_write+0x124/0x240 fs/read_write.c:658 __do_sys_write fs/read_write.c:670 [inline] __se_sys_write fs/read_write.c:667 [inline] __arm64_sys_write+0x78/0xb0 fs/read_write.c:667 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall arch/arm64/kernel/syscall.c:49 [inline] el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129 do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168 el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416 el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432 el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701 The buggy address belongs to the object at ffff23794dda0000 which belongs to the cache pid of size 224 The buggy address is located 4 bytes inside of 224-byte region [ffff23794dda0000, ffff23794dda00e0) The buggy address belongs to the page: page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0 head:(____ptrval____) order:1 compound_mapcount:0 flags: 0x3fffc0000010200(slab|head) raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080 raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00 ================================================================== Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-01sched,init: Fix DEBUG_PREEMPT vs early bootPeter Zijlstra1-5/+6
Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially") to cover the new PF_NO_SETAFFINITY requirement. While there, move wait_for_completion(&kthreadd_done) into kernel_init() to make it absolutely clear it is the very first thing done by the init thread. Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net
2021-05-12sched/core: Initialize the idle task with preemption disabledValentin Schneider1-5/+1
As pointed out by commit de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread") init_idle() can and will be invoked more than once on the same idle task. At boot time, it is invoked for the boot CPU thread by sched_init(). Then smp_init() creates the threads for all the secondary CPUs and invokes init_idle() on them. As the hotplug machinery brings the secondaries to life, it will issue calls to idle_thread_get(), which itself invokes init_idle() yet again. In this case it's invoked twice more per secondary: at _cpu_up(), and at bringup_cpu(). Given smp_init() already initializes the idle tasks for all *possible* CPUs, no further initialization should be required. Now, removing init_idle() from idle_thread_get() exposes some interesting expectations with regards to the idle task's preempt_count: the secondary startup always issues a preempt_disable(), requiring some reset of the preempt count to 0 between hot-unplug and hotplug, which is currently served by idle_thread_get() -> idle_init(). Given the idle task is supposed to have preemption disabled once and never see it re-enabled, it seems that what we actually want is to initialize its preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove init_idle() from idle_thread_get(). Secondary startups were patched via coccinelle: @begone@ @@ -preempt_disable(); ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-10srcu: Initialize SRCU after timersFrederic Weisbecker1-0/+2
Once srcu_init() is called, the SRCU core will make use of delayed workqueues, which rely on timers. However init_timers() is called several steps after rcu_init(). This means that a call_srcu() after rcu_init() but before init_timers() would find itself within a dangerously uninitialized timer core. This commit therefore creates a separate call to srcu_init() after init_timer() completes, which ensures that we stay in early SRCU mode until timers are safe(r). Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+1
Merge yet more updates from Andrew Morton: "This is everything else from -mm for this merge window. 90 patches. Subsystems affected by this patch series: mm (cleanups and slub), alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat, checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov, panic, delayacct, gdb, resource, selftests, async, initramfs, ipc, drivers/char, and spelling" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits) mm: fix typos in comments mm: fix typos in comments treewide: remove editor modelines and cruft ipc/sem.c: spelling fix fs: fat: fix spelling typo of values kernel/sys.c: fix typo kernel/up.c: fix typo kernel/user_namespace.c: fix typos kernel/umh.c: fix some spelling mistakes include/linux/pgtable.h: few spelling fixes mm/slab.c: fix spelling mistake "disired" -> "desired" scripts/spelling.txt: add "overflw" scripts/spelling.txt: Add "diabled" typo scripts/spelling.txt: add "overlfow" arm: print alloc free paths for address in registers mm/vmalloc: remove vwrite() mm: remove xlate_dev_kmem_ptr() drivers/char: remove /dev/kmem for good mm: fix some typos and code style problems ipc/sem.c: mundane typo fixes ...
2021-05-07init/initramfs.c: do unpacking asynchronouslyRasmus Villemoes1-0/+1
Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3. These two patches are independent, but better-together. The second is a rather trivial patch that simply allows the developer to change "/sbin/modprobe" to something else - e.g. the empty string, so that all request_module() during early boot return -ENOENT early, without even spawning a usermode helper, needlessly synchronizing with the initramfs unpacking. The first patch delegates decompressing the initramfs to a worker thread, allowing do_initcalls() in main.c to proceed to the device_ and late_ initcalls without waiting for that decompression (and populating of rootfs) to finish. Obviously, some of those later calls may rely on the initramfs being available, so I've added synchronization points in the firmware loader and usermodehelper paths - there might be other places that would need this, but so far no one has been able to think of any places I have missed. There's not much to win if most of the functionality needed during boot is only available as modules. But systems with a custom-made .config and initramfs can boot faster, partly due to utilizing more than one cpu earlier, partly by avoiding known-futile modprobe calls (which would still trigger synchronization with the initramfs unpacking, thus eliminating most of the first benefit). This patch (of 2): Most of the boot process doesn't actually need anything from the initramfs, until of course PID1 is to be executed. So instead of doing the decompressing and populating of the initramfs synchronously in populate_rootfs() itself, push that off to a worker thread. This is primarily motivated by an embedded ppc target, where unpacking even the rather modest sized initramfs takes 0.6 seconds, which is long enough that the external watchdog becomes unhappy that it doesn't get attention soon enough. By doing the initramfs decompression in a worker thread, we get to do the device_initcalls and hence start petting the watchdog much sooner. Normal desktops might benefit as well. On my mostly stock Ubuntu kernel, my initramfs is a 26M xz-compressed blob, decompressing to around 126M. That takes almost two seconds: [ 0.201454] Trying to unpack rootfs image as initramfs... [ 1.976633] Freeing initrd memory: 29416K Before this patch, these lines occur consecutively in dmesg. With this patch, the timestamps on these two lines is roughly the same as above, but with 172 lines inbetween - so more than one cpu has been kept busy doing work that would otherwise only happen after the populate_rootfs() finished. Should one of the initcalls done after rootfs_initcall time (i.e., device_ and late_ initcalls) need something from the initramfs (say, a kernel module or a firmware blob), it will simply wait for the initramfs unpacking to be done before proceeding, which should in theory make this completely safe. But if some driver pokes around in the filesystem directly and not via one of the official kernel interfaces (i.e. request_firmware*(), call_usermodehelper*) that theory may not hold - also, I certainly might have missed a spot when sprinkling wait_for_initramfs(). So there is an escape hatch in the form of an initramfs_async= command line parameter. Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-03Merge tag 'trace-v5.13' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New feature: - A new "func-no-repeats" option in tracefs/options directory. When set the function tracer will detect if the current function being traced is the same as the previous one, and instead of recording it, it will keep track of the number of times that the function is repeated in a row. And when another function is recorded, it will write a new event that shows the function that repeated, the number of times it repeated and the time stamp of when the last repeated function occurred. Enhancements: - In order to implement the above "func-no-repeats" option, the ring buffer timestamp can now give the accurate timestamp of the event as it is being recorded, instead of having to record an absolute timestamp for all events. This helps the histogram code which no longer needs to waste ring buffer space. - New validation logic to make sure all trace events that access dereferenced pointers do so in a safe way, and will warn otherwise. Fixes: - No longer limit the PIDs of tasks that are recorded for "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger range. This caused the mapping of PIDs to the task names to be dropped for all tasks with a PID greater than 32768. - Change trace_clock_global() to never block. This caused a deadlock. Clean ups: - Typos, prototype fixes, and removing of duplicate or unused code. - Better management of ftrace_page allocations" * tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits) tracing: Restructure trace_clock_global() to never block tracing: Map all PIDs to command lines ftrace: Reuse the output of the function tracer for func_repeats tracing: Add "func_no_repeats" option for function tracing tracing: Unify the logic for function tracing options tracing: Add method for recording "func_repeats" events tracing: Add "last_func_repeats" to struct trace_array tracing: Define new ftrace event "func_repeats" tracing: Define static void trace_print_time() ftrace: Simplify the calculation of page number for ftrace_page->records some more ftrace: Store the order of pages allocated in ftrace_page tracing: Remove unused argument from "ring_buffer_time_stamp() tracing: Remove duplicate struct declaration in trace_events.h tracing: Update create_system_filter() kernel-doc comment tracing: A minor cleanup for create_system_filter() kernel: trace: Mundane typo fixes in the file trace_events_filter.c tracing: Fix various typos in comments scripts/recordmcount.pl: Make vim and emacs indent the same scripts/recordmcount.pl: Make indent spacing consistent tracing: Add a verifier to check string pointers for trace events ...
2021-04-30mm: move mem_init_print_info() into mm_init()Kefeng Wang1-0/+1
mem_init_print_info() is called in mem_init() on each architecture, and pass NULL argument, so using void argument and move it into mm_init(). Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64] Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Guo Ren <guoren@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Peter Zijlstra" <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: HUGE_VMAP arch support cleanupNicholas Piggin1-1/+0
This changes the awkward approach where architectures provide init functions to determine which levels they can provide large mappings for, to one where the arch is queried for each call. This removes code and indirection, and allows constant-folding of dead code for unsupported levels. This also adds a prot argument to the arch query. This is unused currently but could help with some architectures (e.g., some powerpc processors can't map uncacheable memory with large pages). Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-08stack: Optionally randomize kernel stack offset each syscallKees Cook1-0/+23
This provides the ability for architectures to enable kernel stack base address offset randomization. This feature is controlled by the boot param "randomize_kstack_offset=on/off", with its default value set by CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. This feature is based on the original idea from the last public release of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credit for the original idea goes to the PaX team. Note that the design and implementation of this upstream randomize_kstack_offset feature differs greatly from the RANDKSTACK feature (see below). Reasoning for the feature: This feature aims to make harder the various stack-based attacks that rely on deterministic stack structure. We have had many such attacks in past (just to name few): https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html As Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info, STACKLEAK), attackers have had to find new ways for their exploits to work. They have done so, continuing to rely on the kernel's stack determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not relevant. For example, the following recent attacks would have been hampered if the stack offset was non-deterministic between syscalls: https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf (page 70: targeting the pt_regs copy with linear stack overflow) https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html (leaked stack address from one syscall as a target during next syscall) The main idea is that since the stack offset is randomized on each system call, it is harder for an attack to reliably land in any particular place on the thread stack, even with address exposures, as the stack base will change on the next syscall. Also, since randomization is performed after placing pt_regs, the ptrace-based approach[1] to discover the randomized offset during a long-running syscall should not be possible. Design description: During most of the kernel's execution, it runs on the "thread stack", which is pretty deterministic in its structure: it is fixed in size, and on every entry from userspace to kernel on a syscall the thread stack starts construction from an address fetched from the per-cpu cpu_current_top_of_stack variable. The first element to be pushed to the thread stack is the pt_regs struct that stores all required CPU registers and syscall parameters. Finally the specific syscall function is called, with the stack being used as the kernel executes the resulting request. The goal of randomize_kstack_offset feature is to add a random offset after the pt_regs has been pushed to the stack and before the rest of the thread stack is used during the syscall processing, and to change it every time a process issues a syscall. The source of randomness is currently architecture-defined (but x86 is using the low byte of rdtsc()). Future improvements for different entropy sources is possible, but out of scope for this patch. Further more, to add more unpredictability, new offsets are chosen at the end of syscalls (the timing of which should be less easy to measure from userspace than at syscall entry time), and stored in a per-CPU variable, so that the life of the value does not stay explicitly tied to a single task. As suggested by Andy Lutomirski, the offset is added using alloca() and an empty asm() statement with an output constraint, since it avoids changes to assembly syscall entry code, to the unwinder, and provides correct stack alignment as defined by the compiler. In order to make this available by default with zero performance impact for those that don't want it, it is boot-time selectable with static branches. This way, if the overhead is not wanted, it can just be left turned off with no performance impact. The generated assembly for x86_64 with GCC looks like this: ... ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax # 12380 <kstack_offset> ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax ffffffff81003983: 48 83 c0 0f add $0xf,%rax ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax ffffffff8100398c: 48 29 c4 sub %rax,%rsp ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax ... As a result of the above stack alignment, this patch introduces about 5 bits of randomness after pt_regs is spilled to the thread stack on x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for stack alignment). The amount of entropy could be adjusted based on how much of the stack space we wish to trade for security. My measure of syscall performance overhead (on x86_64): lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null randomize_kstack_offset=y Simple syscall: 0.7082 microseconds randomize_kstack_offset=n Simple syscall: 0.7016 microseconds So, roughly 0.9% overhead growth for a no-op syscall, which is very manageable. And for people that don't want this, it's off by default. There are two gotchas with using the alloca() trick. First, compilers that have Stack Clash protection (-fstack-clash-protection) enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to any dynamic stack allocations. While the randomization offset is always less than a page, the resulting assembly would still contain (unreachable!) probing routines, bloating the resulting assembly. To avoid this, -fno-stack-clash-protection is unconditionally added to the kernel Makefile since this is the only dynamic stack allocation in the kernel (now that VLAs have been removed) and it is provably safe from Stack Clash style attacks. The second gotcha with alloca() is a negative interaction with -fstack-protector*, in that it sees the alloca() as an array allocation, which triggers the unconditional addition of the stack canary function pre/post-amble which slows down syscalls regardless of the static branch. In order to avoid adding this unneeded check and its associated performance impact, architectures need to carefully remove uses of -fstack-protector-strong (or -fstack-protector) in the compilation units that use the add_random_kstack() macro and to audit the resulting stack mitigation coverage (to make sure no desired coverage disappears). No change is visible for this on x86 because the stack protector is already unconditionally disabled for the compilation unit, but the change is required on arm64. There is, unfortunately, no attribute that can be used to disable stack protector for specific functions. Comparison to PaX RANDKSTACK feature: The RANDKSTACK feature randomizes the location of the stack start (cpu_current_top_of_stack), i.e. including the location of pt_regs structure itself on the stack. Initially this patch followed the same approach, but during the recent discussions[2], it has been determined to be of a little value since, if ptrace functionality is available for an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write different offsets in the pt_regs struct, observe the cache behavior of the pt_regs accesses, and figure out the random stack offset. Another difference is that the random offset is stored in a per-cpu variable, rather than having it be per-thread. As a result, these implementations differ a fair bit in their implementation details and results, though obviously the intent is similar. [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/ [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/ [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html Co-developed-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
2021-03-18bootconfig: Update prototype of setup_boot_config()Cao jin1-3/+3
Parameter "cmdline" has no use, drop it. Link: https://lkml.kernel.org/r/20210311085213.27680-1-jojing64@gmail.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Cao jin <jojing64@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-26kgdb: fix to kill breakpoints on initmem after bootSumit Garg1-0/+1
Currently breakpoints in kernel .init.text section are not handled correctly while allowing to remove them even after corresponding pages have been freed. Fix it via killing .init.text section breakpoints just prior to initmem pages being freed. Doug: "HW breakpoints aren't handled by this patch but it's probably not such a big deal". Link: https://lkml.kernel.org/r/20210224081652.587785-1-sumit.garg@linaro.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Suggested-by: Doug Anderson <dianders@chromium.org> Acked-by: Doug Anderson <dianders@chromium.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Tested-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26lib: stackdepot: add support to disable stack depotVijayanand Jitta1-0/+2
Add a kernel parameter stack_depot_disable to disable stack depot. So that stack hash table doesn't consume any memory when stack depot is disabled. The use case is CONFIG_PAGE_OWNER without page_owner=on. Without this patch, stackdepot will consume the memory for the hashtable. By default, it's 8M which is never trivial. With this option, in CONFIG_PAGE_OWNER configured system, page_owner=off, stack_depot_disable in kernel command line, we could save the wasted memory for the hashtable. [akpm@linux-foundation.org: fix CONFIG_STACKDEPOT=n build] Link: https://lkml.kernel.org/r/1611749198-24316-2-git-send-email-vjitta@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Alexander Potapenko <glider@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yogesh Lal <ylal@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26mm: add Kernel Electric-Fence infrastructureAlexander Potapenko1-0/+3
Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7. This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. This series enables KFENCE for the x86 and arm64 architectures, and adds KFENCE hooks to the SLAB and SLUB allocators. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, the next allocation through the main allocator (SLAB or SLUB) returns a guarded allocation from the KFENCE object pool. At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. The KFENCE memory pool is of fixed size, and if the pool is exhausted no further KFENCE allocations occur. The default config is conservative with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB pages). We have verified by running synthetic benchmarks (sysbench I/O, hackbench) and production server-workload benchmarks that a kernel with KFENCE (using sample intervals 100-500ms) is performance-neutral compared to a non-KFENCE baseline kernel. KFENCE is inspired by GWP-ASan [1], a userspace tool with similar properties. The name "KFENCE" is a homage to the Electric Fence Malloc Debugger [2]. For more details, see Documentation/dev-tools/kfence.rst added in the series -- also viewable here: https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst [1] http://llvm.org/docs/GwpAsan.html [2] https://linux.die.net/man/3/efence This patch (of 9): This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors. KFENCE is designed to be enabled in production kernels, and has near zero performance overhead. Compared to KASAN, KFENCE trades performance for precision. The main motivation behind KFENCE's design, is that with enough total uptime KFENCE will detect bugs in code paths not typically exercised by non-production test workloads. One way to quickly achieve a large enough total uptime is when the tool is deployed across a large fleet of machines. KFENCE objects each reside on a dedicated page, at either the left or right page boundaries. The pages to the left and right of the object page are "guard pages", whose attributes are changed to a protected state, and cause page faults on any attempted access to them. Such page faults are then intercepted by KFENCE, which handles the fault gracefully by reporting a memory access error. To detect out-of-bounds writes to memory within the object's page itself, KFENCE also uses pattern-based redzones. The following figure illustrates the page layout: ---+-----------+-----------+-----------+-----------+-----------+--- | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | ---+-----------+-----------+-----------+-----------+-----------+--- Guarded allocations are set up based on a sample interval (can be set via kfence.sample_interval). After expiration of the sample interval, a guarded allocation from the KFENCE object pool is returned to the main allocator (SLAB or SLUB). At this point, the timer is reset, and the next allocation is set up after the expiration of the interval. To enable/disable a KFENCE allocation through the main allocator's fast-path without overhead, KFENCE relies on static branches via the static keys infrastructure. The static branch is toggled to redirect the allocation to KFENCE. To date, we have verified by running synthetic benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE is performance-neutral compared to the non-KFENCE baseline. For more details, see Documentation/dev-tools/kfence.rst (added later in the series). [elver@google.com: fix parameter description for kfence_object_start()] Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com [elver@google.com: avoid stalling work queue task without allocations] Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com [elver@google.com: fix potential deadlock due to wake_up()] Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com [elver@google.com: add option to use KFENCE without static keys] Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com [elver@google.com: add missing copyright and description headers] Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: SeongJae Park <sjpark@amazon.de> Co-developed-by: Marco Elver <elver@google.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Joern Engel <joern@purestorage.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-15sfi: Remove framework for deprecated firmwareAndy Shevchenko1-2/+0
SFI-based platforms are gone. So does this framework. This removes mention of SFI through the drivers and other code as well. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-05init/gcov: allow CONFIG_CONSTRUCTORS on UML to fix module gcovJohannes Berg1-1/+7
On ARCH=um, loading a module doesn't result in its constructors getting called, which breaks module gcov since the debugfs files are never registered. On the other hand, in-kernel constructors have already been called by the dynamic linker, so we can't call them again. Get out of this conundrum by allowing CONFIG_CONSTRUCTORS to be selected, but avoiding the in-kernel constructor calls. Also remove the "if !UML" from GCOV selecting CONSTRUCTORS now, since we really do want CONSTRUCTORS, just not kernel binary ones. Link: https://lkml.kernel.org/r/20210120172041.c246a2cac2fb.I1358f584b76f1898373adfed77f4462c8705b736@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jessica Yu <jeyu@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-08Revert "init/console: Use ttynull as a fallback when there is no console"Petr Mladek1-8/+2
This reverts commit 757055ae8dedf5333af17b3b5b4b70ba9bc9da4e. The commit caused that ttynull was used as the default console on several systems[1][2][3]. As a result, the console was blank even when a better alternative existed. It happened when there was no console configured on the command line and ttynull_init() was the first initcall calling register_console(). Or it happened when /dev/ did not exist when console_on_rootfs() was called. It was not able to open /dev/console even though a console driver was registered. It tried to add ttynull console but it obviously did not help. But ttynull became the preferred console and was used by /dev/console when it was available later. The commit tried to fix a historical problem that have been there for ages. The primary motivation was the commit 3cffa06aeef7ece30f6 ("printk/console: Allow to disable console output by using console="" or console=null"). It provided a clean solution for a workaround that was widely used and worked only by chance. This revert causes that the console="" or console=null command line options will again work only by chance. These options will cause that a particular console will be preferred and the default (tty) ones will not get enabled. There will be no console registered at all. As a result there won't be stdin, stdout, and stderr for the init process. But it worked exactly this way even before. The proper solution has to fulfill many conditions: + Register ttynull only when explicitly required or as the ultimate fallback. + ttynull should get associated with /dev/console but it must not become preferred console when used as a fallback. Especially, it must still be possible to replace it by a better console later. Such a change requires clean up of the register_console() code. Otherwise, it would be even harder to follow. Especially, the use of has_preferred_console and CON_CONSDEV flag is tricky. The clean up is risky. The ordering of consoles is not well defined. And any changes tend to break existing user settings. Do the revert at the least risky solution for now. [1] https://lore.kernel.org/linux-kselftest/20201221144302.GR4077@smile.fi.intel.com/ [2] https://lore.kernel.org/lkml/d2a3b3c0-e548-7dd1-730f-59bc5c04e191@synopsys.com/ [3] https://patchwork.ozlabs.org/project/linux-um/patch/20210105120128.10854-1-thomas@m3y3r.de/ Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reported-by: Vineet Gupta <vgupta@synopsys.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-04Merge branch 'rcu/urgent' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This is a fix for a regression in the v5.10 merge window, but it was reported quite late in the v5.10 process, plus generating and testing the fix took some time. The regression is due to commit 36dadef23fcc ("kprobes: Init kprobes in early_initcall") which on powerpc can use RCU Tasks before initialization, resulting in boot failures. The fix is straightforward, simply moving initialization of RCU Tasks before the early_initcall()s. The fix has been exposed to -next and kbuild test robot testing, and has been tested by the PowerPC guys" * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu-tasks: Move RCU-tasks initialization to before early_initcall()
2020-12-16Merge tag 'printk-for-5.11' of ↵Linus Torvalds1-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Finally allow parallel writes and reads into/from the lockless ringbuffer. But it is not a complete solution. Readers are still serialized against each other. And nested writes are still prevented by printk_safe per-CPU buffers. - Use ttynull as the ultimate fallback for /dev/console. - Officially allow disabling console output by using console="" or console=null - A few code cleanups * tag 'printk-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: remove logbuf_lock writer-protection of ringbuffer printk: inline log_output(),log_store() in vprintk_store() printk: remove obsolete dead assignment printk/console: Allow to disable console output by using console="" or console=null init/console: Use ttynull as a fallback when there is no console printk: ringbuffer: Reference text_data_ring directly in callees.
2020-12-15mm, page_alloc: do not rely on the order of page_poison and ↵Vlastimil Babka1-1/+1
init_on_alloc/free parameters Patch series "cleanup page poisoning", v3. I have identified a number of issues and opportunities for cleanup with CONFIG_PAGE_POISON and friends: - interaction with init_on_alloc and init_on_free parameters depends on the order of parameters (Patch 1) - the boot time enabling uses static key, but inefficienty (Patch 2) - sanity checking is incompatible with hibernation (Patch 3) - CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have init_on_free (Patch 4) - CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we have init_on_free (Patch 5) This patch (of 5): Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1 produces a warning in dmesg that page_poison takes precedence. However, as these warnings are printed in early_param handlers for init_on_alloc/free, they are not printed if page_poison is enabled later on the command line (handlers are called in the order of their parameters), or when init_on_alloc/free is always enabled by the respective config option - before the page_poison early param handler is called, it is not considered to be enabled. This is inconsistent. We can remove the dependency on order by making the init_on_* parameters only set a boolean variable, and postponing the evaluation after all early params have been processed. Introduce a new init_mem_debugging_and_hardening() function for that, and move the related debug_pagealloc processing there as well. As a result init_mem_debugging_and_hardening() knows always accurately if init_on_* and/or page_poison options were enabled. Thus we can also optimize want_init_on_alloc() and want_init_on_free(). We don't need to check page_poisoning_enabled() there, we can instead not enable the init_on_* static keys at all, if page poisoning is enabled. This results in a simpler and more effective code. Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mateusz Nosek <mateusznosek0@gmail.com> Cc: Laura Abbott <labbott@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT setLin Feng1-2/+0
In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set, we have following callchain: start_kernel ... mm_init mem_init memblock_free_all reset_all_zones_managed_pages free_low_memory_core_early ... buffer_init nr_free_buffer_pages zone->managed_pages ... rest_init kernel_init kernel_init_freeable page_alloc_init_late kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid); wait_for_completion(&pgdat_init_all_done_comp); ... files_maxfiles_init It's clear that buffer_init depends on zone->managed_pages, but it's reset in reset_all_zones_managed_pages after that pages are readded into zone->managed_pages, but when buffer_init runs this process is half done and most of them will finally be added till deferred_init_memmap done. In large memory couting of nr_free_buffer_pages drifts too much, also drifting from kernels to kernels on same hardware. Fix is simple, it delays buffer_init run till deferred_init_memmap all done. But as corrected by this patch, max_buffer_heads becomes very large, the value is roughly as many as 4 times of totalram_pages, formula: max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct buffer_head)); Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads turns out to be roughly 67,108,864. In common cases, should a buffer_head be mapped to one page/block(4KB)? So max_buffer_heads never exceeds totalram_pages. IMO it's likely to make buffer_heads_over_limit bool value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in vmscan unnecessary. So this patch will change the original behavior related to buffer_heads_over_limit in vmscan since we used a half done value of zone->managed_pages before, or should we use a smaller factor(<10%) in previous formula. akpm: I think this is OK - the max_buffer_heads code is only needed on highmem machines, to prevent ZONE_NORMAL from being consumed by large amounts of buffer_heads attached to highmem pagecache. This problem will not occur on 64-bit machines, so this feature's non-functionality on such machines is a feature, not a bug. Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.com Signed-off-by: Lin Feng <linf@wangsu.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: fix page_owner initializing issue for arm32Zhenhua Huang1-0/+2
Page owner of pages used by page owner itself used is missing on arm32 targets. The reason is dummy_handle and failure_handle is not initialized correctly. Buddy allocator is used to initialize these two handles. However, buddy allocator is not ready when page owner calls it. This change fixed that by initializing page owner after buddy initialization. The working flow before and after this change are: original logic: 1. allocated memory for page_ext(using memblock). 2. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). 3. initialize buddy. after this change: 1. allocated memory for page_ext(using memblock). 2. initialize buddy. 3. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). with the change, failure/dummy_handle can get its correct value and page owner output for example has the one for page owner itself: Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0() init_page_owner+0x28/0x2f8 invoke_init_callbacks_flatmem+0x24/0x34 start_kernel+0x33c/0x5d8 Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.org Signed-off-by: Zhenhua Huang <zhenhuah@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-14rcu-tasks: Move RCU-tasks initialization to before early_initcall()Uladzislau Rezki (Sony)1-0/+1
PowerPC testing encountered boot failures due to RCU Tasks not being fully initialized until core_initcall() time. This commit therefore initializes RCU Tasks (along with Rude RCU and RCU Tasks Trace) just before early_initcall() time, thus allowing waiting on RCU Tasks grace periods from early_initcall() handlers. Link: https://lore.kernel.org/rcu/87eekfh80a.fsf@dja-thinkpad.axtens.net/ Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall") Tested-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-14Merge branch 'for-5.11-null-console' into for-linusPetr Mladek1-2/+8
2020-11-30bootconfig: Load size and checksum in the footer as le32Masami Hiramatsu1-2/+2
Load the size and the checksum fields in the footer as le32 instead of u32. This will allow us to apply bootconfig to the cross build initrd without caring the endianness. Link: https://lkml.kernel.org/r/160583934457.547349.10504070298990791074.stgit@devnote2 Reported-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-20init/console: Use ttynull as a fallback when there is no consolePetr Mladek1-2/+8
stdin, stdout, and stderr standard I/O stream are created for the init process. They are not available when there is no console registered for /dev/console. It might lead to a crash when the init process tries to use them, see the commit 48021f98130880dd742 ("printk: handle blank console arguments passed in."). Normally, ttySX and ttyX consoles are used as a fallback when no consoles are defined via the command line, device tree, or SPCR. But there will be no console registered when an invalid console name is configured or when the configured consoles do not exist on the system. Users even try to avoid the console intentionally, for example, by using console="" or console=null. It is used on production systems where the serial port or terminal are not visible to users. Pushing messages to these consoles would just unnecessary slowdown the system. Make sure that stdin, stdout, stderr, and /dev/console are always available by a fallback to the existing ttynull driver. It has been implemented for exactly this purpose but it was used only when explicitly configured. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20201111135450.11214-2-pmladek@suse.com
2020-11-12bootconfig: Extend the magic check range to the preceding 3 bytesMasami Hiramatsu1-2/+12
Since Grub may align the size of initrd to 4 if user pass initrd from cpio, we have to check the preceding 3 bytes as well. Link: https://lkml.kernel.org/r/160520205132.303174.4876760192433315429.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 85c46b78da58 ("bootconfig: Add bootconfig magic word for indicating bootconfig explicitly") Reported-by: Chen Yu <yu.chen.surf@gmail.com> Tested-by: Chen Yu <yu.chen.surf@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-18Merge tag 'linux-kselftest-kunit-5.10-rc1' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull more Kunit updates from Shuah Khan: - add Kunit to kernel_init() and remove KUnit from init calls entirely. This addresses the concern that Kunit would not work correctly during late init phase. - add a linker section where KUnit can put references to its test suites. This is the first step in transitioning to dispatching all KUnit tests from a centralized executor rather than having each as its own separate late_initcall. - add a centralized executor to dispatch tests rather than relying on late_initcall to schedule each test suite separately. Centralized execution is for built-in tests only; modules will execute tests when loaded. - convert bitfield test to use KUnit framework - Documentation updates for naming guidelines and how kunit_test_suite() works. - add test plan to KUnit TAP format * tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE lib: kunit: add bitfield test conversion to KUnit Documentation: kunit: add a brief blurb about kunit_test_suite kunit: test: add test plan to KUnit TAP format init: main: add KUnit to kernel init kunit: test: create a single centralized executor for all tests vmlinux.lds.h: add linker section for KUnit test suites Documentation: kunit: Add naming guidelines
2020-10-15Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "The latest advances in computer science from the trivial queue" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: xtensa: fix Kconfig typo spelling.txt: Remove some duplicate entries mtd: rawnand: oxnas: cleanup/simplify code selftests: vm: add fragment CONFIG_GUP_BENCHMARK perf: Fix opt help text for --no-bpf-event HID: logitech-dj: Fix spelling in comment bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG MAINTAINERS: rectify MMP SUPPORT after moving cputype.h scif: Fix spelling of EACCES printk: fix global comment lib/bitmap.c: fix spello fs: Fix missing 'bit' in comment
2020-10-09init: main: add KUnit to kernel initBrendan Higgins1-0/+4
Although we have not seen any actual examples where KUnit doesn't work because it runs in the late init phase of the kernel, it has been a concern for some time that this could potentially be an issue in the future. So, remove KUnit from init calls entirely, instead call directly from kernel_init() so that KUnit runs after late init. Co-developed-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Brendan Higgins <brendanhiggins@google.com> Reviewed-by: Stephen Boyd <sboyd@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-09-18bootconfig: init: make xbc_namebuf staticJason Yan1-1/+1
This eliminates the following sparse warning: init/main.c:306:6: warning: symbol 'xbc_namebuf' was not declared. Should it be static? Link: https://lkml.kernel.org/r/20200915070324.2239473-1-yanaijie@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18kprobes: tracing/kprobes: Fix to kill kprobes on initmem after bootMasami Hiramatsu1-0/+2
Since kprobe_event= cmdline option allows user to put kprobes on the functions in initmem, kprobe has to make such probes gone after boot. Currently the probes on the init functions in modules will be handled by module callback, but the kernel init text isn't handled. Without this, kprobes may access non-exist text area to disable or remove it. Link: https://lkml.kernel.org/r/159972810544.428528.1839307531600646955.stgit@devnote2 Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-01bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIGShaokun Zhang1-1/+1
Fix up one typo: CONFIG_BOOTCONFIG -> CONFIG_BOOT_CONFIG Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-08-07Merge tag 'trace-v5.9' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - The biggest news in that the tracing ring buffer can now time events that interrupted other ring buffer events. Before this change, if an interrupt came in while recording another event, and that interrupt also had an event, those events would all have the same time stamp as the event it interrupted. Now, with the new design, those events will have a unique time stamp and rightfully display the time for those events that were recorded while interrupting another event. - Bootconfig how has an "override" operator that lets the users have a default config, but then add options to override the default. - A fix was made to properly filter function graph tracing to the ftrace PIDs. This came in at the end of the -rc cycle, and needs to be backported. - Several clean ups, performance updates, and minor fixes as well. * tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits) tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE tracing: Use trace_sched_process_free() instead of exit() for pid tracing bootconfig: Fix to find the initargs correctly Documentation: bootconfig: Add bootconfig override operator tools/bootconfig: Add testcases for value override operator lib/bootconfig: Add override operator support kprobes: Remove show_registers() function prototype tracing/uprobe: Remove dead code in trace_uprobe_register() kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler ftrace: Fix ftrace_trace_task return value tracepoint: Use __used attribute definitions from compiler_attributes.h tracepoint: Mark __tracepoint_string's __used trace : Have tracing buffer info use kvzalloc instead of kzalloc tracing: Remove outdated comment in stack handling ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines ftrace: Setup correct FTRACE_FL_REGS flags for module tracing/hwlat: Honor the tracing_cpumask tracing/hwlat: Drop the duplicate assignment in start_kthread() tracing: Save one trace_event->type by using __TRACE_LAST_TYPE ...
2020-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge misc updates from Andrew Morton: - a few MM hotfixes - kthread, tools, scripts, ntfs and ocfs2 - some of MM Subsystems affected by this patch series: kthread, tools, scripts, ntfs, ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore, sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) mm: vmscan: consistent update to pgrefill mm/vmscan.c: fix typo khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: retract_page_tables() remember to test exit khugepaged: collapse_pte_mapped_thp() protect the pmd lock khugepaged: collapse_pte_mapped_thp() flush the right range mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible mm: thp: replace HTTP links with HTTPS ones mm/page_alloc: fix memalloc_nocma_{save/restore} APIs mm/page_alloc.c: skip setting nodemask when we are in interrupt mm/page_alloc: fallbacks at most has 3 elements mm/page_alloc: silence a KASAN false positive mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask() mm/page_alloc.c: simplify pageblock bitmap access mm/page_alloc.c: extract the common part in pfn_to_bitidx() mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits mm/shuffle: remove dynamic reconfiguration mm/memory_hotplug: document why shuffle_zone() is relevant mm/page_alloc: remove nr_free_pagecache_pages() mm: remove vm_total_pages ...
2020-08-07kasan, arm64: don't instrument functions that enable kasanAndrey Konovalov1-1/+1
This patch prepares Software Tag-Based KASAN for stack tagging support. With stack tagging enabled, KASAN tags stack variable in each function in its prologue. In start_kernel() stack variables get tagged before KASAN is enabled via setup_arch()->kasan_init(). As the result the tags for start_kernel()'s stack variables end up in the temporary shadow memory. Later when KASAN gets enabled, switched to normal shadow, and starts checking tags, this leads to false-positive reports, as proper tags are missing in normal shadow. Disable KASAN instrumentation for start_kernel(). Also disable it for arm64's setup_arch() as a precaution (it doesn't have any stack variables right now). [andreyknvl@google.com: reorder attributes for start_kernel()] Link: http://lkml.kernel.org/r/26fb6165a17abcf61222eda5184c030fb6b133d1.1596544734.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Elena Petrova <lenaptr@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Cc: Ard Biesheuvel <ardb@kernel.org> Link: http://lkml.kernel.org/r/55d432671a92e931ab8234b03dc36b14d4c21bfb.1596199677.git.andreyknvl@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07Merge branch 'hch.init_path' of ↵Linus Torvalds1-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull init and set_fs() cleanups from Al Viro: "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series" * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits) init: add an init_dup helper init: add an init_utimes helper init: add an init_stat helper init: add an init_mknod helper init: add an init_mkdir helper init: add an init_symlink helper init: add an init_link helper init: add an init_eaccess helper init: add an init_chmod helper init: add an init_chown helper init: add an init_chroot helper init: add an init_chdir helper init: add an init_rmdir helper init: add an init_unlink helper init: add an init_umount helper init: add an init_mount helper init: mark create_dev as __init init: mark console_on_rootfs as __init init: initialize ramdisk_execute_command at compile time devtmpfs: refactor devtmpfsd() ...
2020-08-04init: add an init_dup helperChristoph Hellwig1-4/+4
Add a simple helper to grab a reference to a file and install it at the next available fd, and switch the early init code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-04Merge branch 'exec-linus' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "During the development of v5.7 I ran into bugs and quality of implementation issues related to exec that could not be easily fixed because of the way exec is implemented. So I have been diggin into exec and cleaning up what I can. This cycle I have been looking at different ideas and different implementations to see what is possible to improve exec, and cleaning the way exec interfaces with in kernel users. Only cleaning up the interfaces of exec with rest of the kernel has managed to stabalize and make it through review in time for v5.9-rc1 resulting in 2 sets of changes this cycle. - Implement kernel_execve - Make the user mode driver code a better citizen With kernel_execve the code size got a little larger as the copying of parameters from userspace and copying of parameters from userspace is now separate. The good news is kernel threads no longer need to play games with set_fs to use exec. Which when combined with the rest of Christophs set_fs changes should security bugs with set_fs much more difficult" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits) exec: Implement kernel_execve exec: Factor bprm_stack_limits out of prepare_arg_pages exec: Factor bprm_execve out of do_execve_common exec: Move bprm_mm_init into alloc_bprm exec: Move initialization of bprm->filename into alloc_bprm exec: Factor out alloc_bprm exec: Remove unnecessary spaces from binfmts.h umd: Stop using split_argv umd: Remove exit_umh bpfilter: Take advantage of the facilities of struct pid exit: Factor thread_group_exited out of pidfd_poll umd: Track user space drivers with struct pid bpfilter: Move bpfilter_umh back into init data exec: Remove do_execve_file umh: Stop calling do_execve_file umd: Transform fork_usermode_blob into fork_usermode_driver umd: Rename umd_info.cmdline umd_info.driver_name umd: For clarity rename umh_info umd_info umh: Separate the user mode driver and the user mode helper support umh: Remove call_usermodehelper_setup_file. ...
2020-08-04bootconfig: Fix to find the initargs correctlyMasami Hiramatsu1-6/+8
Since the parse_args() stops parsing at '--', bootconfig_params() will never get the '--' as param and initargs_found never be true. In the result, if we pass some init arguments via the bootconfig, those are always appended to the kernel command line with '--' even if the kernel command line already has '--'. To fix this correctly, check the return value of parse_args() and set initargs_found true if the return value is not an error but a valid address. Link: https://lkml.kernel.org/r/159650953285.270383.14822353843556363851.stgit@devnote2 Fixes: f61872bb58a1 ("bootconfig: Use parse_args() to find bootconfig and '--'") Cc: stable@vger.kernel.org Reported-by: Arvind Sankar <nivedita@alum.mit.edu> Suggested-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-31init: add an init_eaccess helperChristoph Hellwig1-2/+2
Add a simple helper to check if a file exists based on kernel space file name and switch the early init code over to it. Note that this theoretically changes behavior as it always is based on the effective permissions. But during early init that doesn't make a difference. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: mark console_on_rootfs as __initChristoph Hellwig1-1/+1
This helper is only used for the early init code. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: initialize ramdisk_execute_command at compile timeChristoph Hellwig1-5/+1
Set ramdisk_execute_command to "/init" at compile time. The command line can still override it, but this saves a few instructions and removes a NULL check. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31init: open code setting up stdin/stdout/stderrChristoph Hellwig1-6/+10
Don't rely on the implicit set_fs(KERNEL_DS) for ksys_open to work, but instead open a struct file for /dev/console and then install it as FD 0/1/2 manually. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-21exec: Implement kernel_execveEric W. Biederman1-3/+1
To allow the kernel not to play games with set_fs to call exec implement kernel_execve. The function kernel_execve takes pointers into kernel memory and copies the values pointed to onto the new userspace stack. The calls with arguments from kernel space of do_execve are replaced with calls to kernel_execve. The calls do_execve and do_execveat are made static as there are now no callers outside of exec. The comments that mention do_execve are updated to refer to kernel_execve or execve depending on the circumstances. In addition to correcting the comments, this makes it easy to grep for do_execve and verify it is not used. Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-06-16security: allow using Clang's zero initialization for stack variablesglider@google.com1-5/+7
In addition to -ftrivial-auto-var-init=pattern (used by CONFIG_INIT_STACK_ALL now) Clang also supports zero initialization for locals enabled by -ftrivial-auto-var-init=zero. The future of this flag is still being debated (see https://bugs.llvm.org/show_bug.cgi?id=45497). Right now it is guarded by another flag, -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang, which means it may not be supported by future Clang releases. Another possible resolution is that -ftrivial-auto-var-init=zero will persist (as certain users have already started depending on it), but the name of the guard flag will change. In the meantime, zero initialization has proven itself as a good production mitigation measure against uninitialized locals. Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes. On the other hand, pattern initialization remains safer for return values. Chrome OS and Android are moving to using zero initialization for production builds. Performance-wise, the difference between pattern and zero initialization is usually negligible, although the generated code for zero initialization is more compact. This patch renames CONFIG_INIT_STACK_ALL to CONFIG_INIT_STACK_ALL_PATTERN and introduces another config option, CONFIG_INIT_STACK_ALL_ZERO, that enables zero initialization for locals if the corresponding flags are supported by Clang. Cc: Kees Cook <keescook@chromium.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20200616083435.223038-1-glider@google.com Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-11Rebase locking/kcsan to locking/urgentThomas Gleixner1-0/+2
Merge the state of the locking kcsan branch before the read/write_once() and the atomics modifications got merged. Squash the fallout of the rebase on top of the read/write once and atomic fallback work into the merge. The history of the original branch is preserved in tag locking-kcsan-2020-06-02. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-08kernel/sysctl: support setting sysctl parameters from kernel command lineVlastimil Babka1-0/+2
Patch series "support setting sysctl parameters from kernel command line", v3. This series adds support for something that seems like many people always wanted but nobody added it yet, so here's the ability to set sysctl parameters via kernel command line options in the form of sysctl.vm.something=1 The important part is Patch 1. The second, not so important part is an attempt to clean up legacy one-off parameters that do the same thing as a sysctl. I don't want to remove them completely for compatibility reasons, but with generic sysctl support the idea is to remove the one-off param handlers and treat the parameters as aliases for the sysctl variants. I have identified several parameters that mention sysctl counterparts in Documentation/admin-guide/kernel-parameters.txt but there might be more. The conversion also has varying level of success: - numa_zonelist_order is converted in Patch 2 together with adding the necessary infrastructure. It's easy as it doesn't really do anything but warn on deprecated value these days. - hung_task_panic is converted in Patch 3, but there's a downside that now it only accepts 0 and 1, while previously it was any integer value - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic, so there's no straighforward conversion possible - traceoff_on_warning is a flag without value and it would be required to handle that somehow in the conversion infractructure, which seems pointless for a single flag This patch (of 5): A recently proposed patch to add vm_swappiness command line parameter in addition to existing sysctl [1] made me wonder why we don't have a general support for passing sysctl parameters via command line. Googling found only somebody else wondering the same [2], but I haven't found any prior discussion with reasons why not to do this. Settings the vm_swappiness issue aside (the underlying issue might be solved in a different way), quick search of kernel-parameters.txt shows there are already some that exist as both sysctl and kernel parameter - hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning. A general mechanism would remove the need to add more of those one-offs and might be handy in situations where configuration by e.g. /etc/sysctl.d/ is impractical. Hence, this patch adds a new parse_args() pass that looks for parameters prefixed by 'sysctl.' and tries to interpret them as writes to the corresponding sys/ files using an temporary in-kernel procfs mount. This mechanism was suggested by Eric W. Biederman [3], as it handles all dynamically registered sysctl tables, even though we don't handle modular sysctls. Errors due to e.g. invalid parameter name or value are reported in the kernel log. The processing is hooked right before the init process is loaded, as some handlers might be more complicated than simple setters and might need some subsystems to be initialized. At the moment the init process can be started and eventually execute a process writing to /proc/sys/ then it should be also fine to do that from the kernel. Sysctls registered later on module load time are not set by this mechanism - it's expected that in such scenarios, setting sysctl values from userspace is practical enough. [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/ [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04init: allow distribution configuration of default initChris Down1-0/+10
Some init systems (eg. systemd) have init at their own paths, for example, /usr/lib/systemd/systemd. A compatibility symlink to one of the hardcoded init paths is provided by another package, usually named something like systemd-sysvcompat or similar. Currently distro maintainers who are hands-off on the bootloader are more or less required to include those compatibility links as part of their base distribution, because it's hard to migrate away from them since there's a risk some users will not get the message to set init= on the kernel command line appropriately. Moreover, for distributions where the init system is something the distribution itself is opinionated about (eg. Arch, which has systemd in the required `base` package), we could usually reasonably configure this ahead of time when building the distribution kernel. However, we currently simply don't have any way to configure the kernel to do this. Here's an example discussion where removing sysvcompat was discussed by distro maintainers[0]. This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set is tried before the hardcoded fallback list. So the order of precedence is now thus: 1. init= on command line (on failure: panic) 2. CONFIG_DEFAULT_INIT (on failure: try #3) 3. Hardcoded fallback list (on failure: panic) This new config parameter will allow distribution maintainers to move away from these compatibility links safely, without having to worry that their users might not have the right init=. There are also two other benefits of this over having the distribution maintain a symlink: 1. One of the value propositions over simply having distributions maintain a /sbin/init symlink via a package is that it also frees distributions which have a preferred default, but not mandatory, init system from having their package manager fight with their users for control of /{s,}bin/init. Instead, the distribution simply makes their preference known in CONFIG_DEFAULT_INIT, and if the user installs another init system and uninstalls the default one they can still make use of /{s,}bin/init and friends for their own uses. This makes more cases Just Work(tm) without the user having to perform extra configuration via init=. 2. Since before this we don't know which path the distribution actually _intends_ to serve init from, we don't pr_err if it is simply missing, and usually will just silently put the user in a /bin/sh shell. Now that the distribution can make a declaration of intent, we can be more vocal when this init system fails to launch for any reason, even if it's simply because no file exists at that location, speeding up the palaver of init/mount dependency/etc debugging a bit. [0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.html Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03padata: initialize earlierDaniel Jordan1-0/+2
padata will soon initialize the system's struct pages in parallel, so it needs to be ready by page_alloc_init_late(). The error return from padata_driver_init() triggers an initcall warning, so add a warning to padata_init() to avoid silent failure. Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Josh Triplett <josh@joshtriplett.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-17Merge tag 'x86_urgent_for_v5.7-rc7' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: "A single fix for early boot crashes of kernels built with gcc10 and stack protector enabled" * tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix early boot crash on gcc-10, third try
2020-05-15x86: Fix early boot crash on gcc-10, third tryx86_urgent_for_v5.7-rc7Borislav Petkov1-0/+2
... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Kalle Valo <kvalo@codeaurora.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-05-12bootconfig: Fix to prevent warning message if no bootconfig optionMasami Hiramatsu1-4/+6
Commit de462e5f1071 ("bootconfig: Fix to remove bootconfig data from initrd while boot") causes a cosmetic regression on dmesg, which warns "no bootconfig data" message without bootconfig cmdline option. Fix setup_boot_config() by moving no bootconfig check after commandline option check. Link: http://lkml.kernel.org/r/9b1ba335-071d-c983-89a4-2677b522dcc8@molgen.mpg.de Link: http://lkml.kernel.org/r/158916116468.21787.14558782332170588206.stgit@devnote2 Fixes: de462e5f1071 ("bootconfig: Fix to remove bootconfig data from initrd while boot") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06bootconfig: Fix to remove bootconfig data from initrd while bootMasami Hiramatsu1-17/+52
If there is a bootconfig data in the tail of initrd/initramfs, initrd image sanity check caused an error while decompression stage as follows. [ 0.883882] Unpacking initramfs... [ 2.696429] Initramfs unpacking failed: invalid magic at start of compressed archive This error will be ignored if CONFIG_BLK_DEV_RAM=n, but CONFIG_BLK_DEV_RAM=y the kernel failed to mount rootfs and causes a panic. To fix this issue, shrink down the initrd_end for removing tailing bootconfig data while boot the kernel. Link: http://lkml.kernel.org/r/158788401014.24243.17424755854115077915.stgit@devnote2 Cc: Borislav Petkov <bp@alien8.de> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: 7684b8582c24 ("bootconfig: Load boot config from the tail of initrd") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-13Merge tag 'v5.7-rc1' into locking/kcsan, to resolve conflicts and refreshIngo Molnar1-5/+10
Resolve these conflicts: arch/x86/Kconfig arch/x86/kernel/Makefile Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines in the Kconfig to reduce the probability of future conflicts. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-10printk: queue wake_up_klogd irq_work only if per-CPU areas are readySergey Senozhatsky1-1/+0
printk_deferred(), similarly to printk_safe/printk_nmi, does not immediately attempt to print a new message on the consoles, avoiding calls into non-reentrant kernel paths, e.g. scheduler or timekeeping, which potentially can deadlock the system. Those printk() flavors, instead, rely on per-CPU flush irq_work to print messages from safer contexts. For same reasons (recursive scheduler or timekeeping calls) printk() uses per-CPU irq_work in order to wake up user space syslog/kmsg readers. However, only printk_safe/printk_nmi do make sure that per-CPU areas have been initialised and that it's safe to modify per-CPU irq_work. This means that, for instance, should printk_deferred() be invoked "too early", that is before per-CPU areas are initialised, printk_deferred() will perform illegal per-CPU access. Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random: silence a lockdep splat with printk()") user-space syslog/kmsg readers are not able to read new kernel messages. The reason is printk_deferred() being called too early (as was pointed out by Petr and John). Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU areas are initialized. Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/ Reported-by: Lech Perczak <l.perczak@camlintechnologies.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Jann Horn <jannh@google.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21Merge branch 'linus' into locking/kcsan, to pick up fixesIngo Molnar1-16/+22
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21Merge branch 'x86/kdump' into locking/kcsan, to resolve conflictsIngo Molnar1-27/+263
Conflicts: arch/x86/purgatory/Makefile Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-03tools/bootconfig: Show line and column in parse errorMasami Hiramatsu1-4/+10
Show line and column when we got a parse error in bootconfig tool. Current lib/bootconfig shows the parse error with byte offset, but that is not human readable. This makes xbc_init() not showing error message itself but able to pass the error message and position to caller, so that the caller can decode it and show the error message with line number and columns. With this patch, bootconfig tool shows an error with line:column as below. $ cat samples/bad-dotword.bconf # do not start keyword with . key { .word = 1 } $ ./bootconfig -a samples/bad-dotword.bconf initrd Parse Error: Invalid keyword at 3:3 Link: http://lkml.kernel.org/r/158323469002.10560.4023923847704522760.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-26Merge tag 'trace-v5.6-rc2' of ↵Linus Torvalds1-16/+22
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-20bootconfig: Print array as multiple commands for legacy command lineMasami Hiramatsu1-14/+8
Print arraied values as multiple same options for legacy kernel command line. With this rule, if the "kernel.*" and "init.*" array entries in bootconfig are printed out as multiple same options, e.g. kernel { console = "ttyS0,115200" console += "tty0" } will be correctly converted to console="ttyS0,115200" console="tty0" in the kernel command line. Link: http://lkml.kernel.org/r/158220118213.26565.8163300497009463916.stgit@devnote2 Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20bootconfig: Add bootconfig magic word for indicating bootconfig explicitlyMasami Hiramatsu1-1/+5
Add bootconfig magic word to the end of bootconfig on initrd image for indicating explicitly the bootconfig is there. Also tools/bootconfig treats wrong size or wrong checksum or parse error as an error, because if there is a bootconfig magic word, there must be a bootconfig. The bootconfig magic word is "#BOOTCONFIG\n", 12 bytes word. Thus the block image of the initrd file with bootconfig is as follows. [Initrd][bootconfig][size][csum][#BOOTCONFIG\n] Link: http://lkml.kernel.org/r/158220112263.26565.3944814205960612841.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20bootconfig: Set CONFIG_BOOT_CONFIG=n by defaultMasami Hiramatsu1-0/+8
Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20bootconfig: Mark boot_config_checksum() staticQiujun Huang1-1/+1
In fact, this function is only used in this file, so mark it with 'static'. Link: http://lkml.kernel.org/r/1581852511-14163-1-git-send-email-hqjagain@gmail.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-11Merge tag 'trace-v5.6-rc1' of ↵Linus Torvalds1-7/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various fixes: - Fix an uninitialized variable - Fix compile bug to bootconfig userspace tool (in tools directory) - Suppress some error messages of bootconfig userspace tool - Remove unneded CONFIG_LIBXBC from bootconfig - Allocate bootconfig xbc_nodes dynamically. To ease complaints about taking up static memory at boot up - Use of parse_args() to parse bootconfig instead of strstr() usage Prevents issues of double quotes containing the interested string - Fix missing ring_buffer_nest_end() on synthetic event error path - Return zero not -EINVAL on soft disabled synthetic event (soft disabling must be the same as hard disabling, which returns zero) - Consolidate synthetic event code (remove duplicate code)" * tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Consolidate trace() functions tracing: Don't return -EINVAL when tracing soft disabled synth events tracing: Add missing nest end to synth_event_trace_start() error case tools/bootconfig: Suppress non-error messages bootconfig: Allocate xbc_nodes array dynamically bootconfig: Use parse_args() to find bootconfig and '--' tracing/kprobe: Fix uninitialized variable bug bootconfig: Remove unneeded CONFIG_LIBXBC tools/bootconfig: Fix wrong __VA_ARGS__ usage
2020-02-10bootconfig: Use parse_args() to find bootconfig and '--'Steven Rostedt (VMware)1-7/+30
The current implementation does a naive search of "bootconfig" on the kernel command line. But this could find "bootconfig" that is part of another option in quotes (although highly unlikely). But it also needs to find '--' on the kernel command line to know if it should append a '--' or not when a bootconfig in the initrd file has an "init" section. The check uses the naive strstr() to find to see if it exists. But this can return a false positive if it exists in an option and then the "init" section in the initrd will not be appended properly. Using parse_args() to find both of these will solve both of these problems. Link: https://lore.kernel.org/r/202002070954.C18E7F58B@keescook Fixes: 7495e0926fdf3 ("bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline") Fixes: 1319916209ce8 ("bootconfig: init: Allow admin to use bootconfig for init command line") Reported-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-06Merge tag 'trace-v5.6-2' of ↵Linus Torvalds1-17/+212
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added new "bootconfig". This looks for a file appended to initrd to add boot config options, and has been discussed thoroughly at Linux Plumbers. Very useful for adding kprobes at bootup. Only enabled if "bootconfig" is on the real kernel command line. - Created dynamic event creation. Merges common code between creating synthetic events and kprobe events. - Rename perf "ring_buffer" structure to "perf_buffer" - Rename ftrace "ring_buffer" structure to "trace_buffer" Had to rename existing "trace_buffer" to "array_buffer" - Allow trace_printk() to work withing (some) tracing code. - Sort of tracing configs to be a little better organized - Fixed bug where ftrace_graph hash was not being protected properly - Various other small fixes and clean ups * tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits) bootconfig: Show the number of nodes on boot message tools/bootconfig: Show the number of bootconfig nodes bootconfig: Add more parse error messages bootconfig: Use bootconfig instead of boot config ftrace: Protect ftrace_graph_hash with ftrace_sync ftrace: Add comment to why rcu_dereference_sched() is open coded tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu tracing: Annotate ftrace_graph_hash pointer with __rcu bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline tracing: Use seq_buf for building dynevent_cmd string tracing: Remove useless code in dynevent_arg_pair_add() tracing: Remove check_arg() callbacks from dynevent args tracing: Consolidate some synth_event_trace code tracing: Fix now invalid var_ref_vals assumption in trace action tracing: Change trace_boot to use synth_event interface tracing: Move tracing selftests to bottom of menu tracing: Move mmio tracer config up with the other tracers tracing: Move tracing test module configs together tracing: Move all function tracing configs together tracing: Documentation for in-kernel synthetic event API ...
2020-02-05bootconfig: Show the number of nodes on boot messageMasami Hiramatsu1-2/+4
Show the number of bootconfig nodes on boot message. Link: http://lkml.kernel.org/r/158091062297.27924.9051634676068550285.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05bootconfig: Use bootconfig instead of boot configMasami Hiramatsu1-3/+3
Use "bootconfig" (1 word) instead of "boot config" (2 words) in the boot message. Link: http://lkml.kernel.org/r/158091059459.27924.14414336187441539879.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdlineSteven Rostedt (VMware)1-7/+21
As the bootconfig is appended to the initrd it is not as easy to modify as the kernel command line. If there's some issue with the kernel, and the developer wants to boot a pristine kernel, it should not be needed to modify the initrd to remove the bootconfig for a single boot. As bootconfig is silently added (if the admin does not know where to look they may not know it's being loaded). It should be explicitly added to the kernel cmdline. The loading of the bootconfig is only done if "bootconfig" is on the kernel command line. This will let admins know that the kernel command line is extended. Note, after adding printk()s for when the size is too great or the checksum is wrong, exposed that the current method always looked for the boot config, and if this size and checksum matched, it would parse it (as if either is wrong a printk has been added to show this). It's better to only check this if the boot config is asked to be looked for. Link: https://lore.kernel.org/r/CAHk-=wjfjO+h6bQzrTf=YCZA53Y3EDyAs3Z4gEsT7icA3u_Psw@mail.gmail.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>