| Age | Commit message (Collapse) | Author | Files | Lines |
|
The kernel cmdline length is allowed to be longer than what printk can
handle. When this happens the cmdline that's printed to the kernel ring
buffer at bootup is cutoff and some kernel cmdline options are "hidden"
from the logs. This undercuts the usefulness of the log message.
Specifically, grepping for COMMAND_LINE_SIZE shows that 2048 is common and
some architectures even define it as 4096. s390 allows a CONFIG-based
maximum up to 1MB (though it's not expected that anyone will go over the
default max of 4096 [1]).
The maximum message pr_notice() seems to be able to handle (based on
experiment) is 1021 characters. This appears to be based on the current
value of PRINTKRB_RECORD_MAX as 1024 and the fact that pr_notice() spends
2 characters on the loglevel prefix and we have a '\n' at the end.
While it would be possible to increase the limits of printk() (and
therefore pr_notice()) somewhat, it doesn't appear possible to increase it
enough to fully include a 2048-character cmdline without breaking
userspace. Specifically on at least two tested userspaces (ChromeOS plus
the Debian-based distro I'm typing this message on) the `dmesg` tool reads
lines from `/dev/kmsg` in 2047-byte chunks. As per
`Documentation/ABI/testing/dev-kmsg`:
Every read() from the opened device node receives one record
of the kernel's printk buffer.
...
Messages in the record ring buffer get overwritten as whole,
there are never partial messages received by read().
We simply can't fit a 2048-byte cmdline plus the "Kernel command line:"
prefix plus info about time/log_level/etc in a 2047-byte read.
The above means that if we want to avoid the truncation we need to do some
type of wrapping of the cmdline when printing.
Add wrapping to the printout of the kernel command line. By default, the
wrapping is set to 1021 characters to avoid breaking anyone, but allow
wrapping to be set lower by a Kconfig knob
"CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN". Any tools that are correctly parsing
the cmdline today (because it is less than 1021 characters) will see no
difference in their behavior. The format of wrapped output is designed to
be matched by anyone using "grep" to search for the cmdline and also to be
easy for tools to handle. Anyone who is sure their tools (if any) handle
the wrapped format can choose a lower wrapping value and have prettier
output.
Setting CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN to 0 fully disables the wrapping
logic. This means that long command lines will be truncated again, but
this config could be set if command lines are expected to be long and
userspace is known not to handle parsing logs with the wrapping.
Wrapping is based on spaces, ignoring quotes. All lines are prefixed with
"Kernel command line: " and lines that are not the last line have a " \"
suffix added to them. The prefix and suffix count towards the line length
for wrapping purposes. The ideal length will be exceeded if no
appropriate place to wrap is found.
The wrapping function added here is fairly generic and could be made a
library function (somewhat like print_hex_dump()) if it's needed elsewhere
in the kernel. However, having printk() directly incorporate this
wrapping would be unlikely to be a good idea since it would break
printouts into more than one record without any obvious common line prefix
to tie lines together. It would also be extra overhead when, in general,
kernel log message should simply be kept smaller than 1021 bytes. For
some discussion on this topic, see responses to the v1 posting of this
patch [2].
[akpm@linux-foundation.org: make print_kernel_cmdline __init]
[dianders@chromium.org: v4]
Link: https://lkml.kernel.org/r/20251027082204.v4.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid
Link: https://lkml.kernel.org/r/20251023113257.v3.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid
Link: https://lore.kernel.org/r/20251021131633.26700Dd6-hca@linux.ibm.com [1]
Link: https://lore.kernel.org/r/CAD=FV=VNyt1zG_8pS64wgV8VkZWiWJymnZ-XCfkrfaAhhFSKcA@mail.gmail.com [2]
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Chant <achant@google.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Francesco Valla <francesco@valla.it>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: guoweikang <guoweikang.kernel@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Hendrik Farr <kernel@jfarr.cc>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ida: Remove the ida_simple_xxx() API" from Christophe Jaillet
completes the removal of this legacy IDR API
- "panic: introduce panic status function family" from Jinchao Wang
provides a number of cleanups to the panic code and its various
helpers, which were rather ad-hoc and scattered all over the place
- "tools/delaytop: implement real-time keyboard interaction support"
from Fan Yu adds a few nice user-facing usability changes to the
delaytop monitoring tool
- "efi: Fix EFI boot with kexec handover (KHO)" from Evangelos
Petrongonas fixes a panic which was happening with the combination of
EFI and KHO
- "Squashfs: performance improvement and a sanity check" from Phillip
Lougher teaches squashfs's lseek() about SEEK_DATA/SEEK_HOLE. A mere
150x speedup was measured for a well-chosen microbenchmark
- plus another 50-odd singleton patches all over the place
* tag 'mm-nonmm-stable-2025-10-02-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (75 commits)
Squashfs: reject negative file sizes in squashfs_read_inode()
kallsyms: use kmalloc_array() instead of kmalloc()
MAINTAINERS: update Sibi Sankar's email address
Squashfs: add SEEK_DATA/SEEK_HOLE support
Squashfs: add additional inode sanity checking
lib/genalloc: fix device leak in of_gen_pool_get()
panic: remove CONFIG_PANIC_ON_OOPS_VALUE
ocfs2: fix double free in user_cluster_connect()
checkpatch: suppress strscpy warnings for userspace tools
cramfs: fix incorrect physical page address calculation
kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
Squashfs: fix uninit-value in squashfs_get_parent
kho: only fill kimage if KHO is finalized
ocfs2: avoid extra calls to strlen() after ocfs2_sprintf_system_inode_name()
kernel/sys.c: fix the racy usage of task_lock(tsk->group_leader) in sys_prlimit64() paths
sched/task.h: fix the wrong comment on task_lock() nesting with tasklist_lock
coccinelle: platform_no_drv_owner: handle also built-in drivers
coccinelle: of_table: handle SPI device ID tables
lib/decompress: use designated initializers for struct compress_format
efi: support booting with kexec handover (KHO)
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains a larger set of changes around the generic namespace
infrastructure of the kernel.
Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
ns_common which carries the reference count of the namespace and so
on.
We open-coded and cargo-culted so many quirks for each namespace type
that it just wasn't scalable anymore. So given there's a bunch of new
changes coming in that area I've started cleaning all of this up.
The core change is to make it possible to correctly initialize every
namespace uniformly and derive the correct initialization settings
from the type of the namespace such as namespace operations, namespace
type and so on. This leaves the new ns_common_init() function with a
single parameter which is the specific namespace type which derives
the correct parameters statically. This also means the compiler will
yell as soon as someone does something remotely fishy.
The ns_common_init() addition also allows us to remove ns_alloc_inum()
and drops any special-casing of the initial network namespace in the
network namespace initialization code that Linus complained about.
Another part is reworking the reference counting. The reference
counting was open-coded and copy-pasted for each namespace type even
though they all followed the same rules. This also removes all open
accesses to the reference count and makes it private and only uses a
very small set of dedicated helpers to manipulate them just like we do
for e.g., files.
In addition this generalizes the mount namespace iteration
infrastructure introduced a few cycles ago. As reminder, the vfs makes
it possible to iterate sequentially and bidirectionally through all
mount namespaces on the system or all mount namespaces that the caller
holds privilege over. This allow userspace to iterate over all mounts
in all mount namespaces using the listmount() and statmount() system
call.
Each mount namespace has a unique identifier for the lifetime of the
systems that is exposed to userspace. The network namespace also has a
unique identifier working exactly the same way. This extends the
concept to all other namespace types.
The new nstree type makes it possible to lookup namespaces purely by
their identifier and to walk the namespace list sequentially and
bidirectionally for all namespace types, allowing userspace to iterate
through all namespaces. Looking up namespaces in the namespace tree
works completely locklessly.
This also means we can move the mount namespace onto the generic
infrastructure and remove a bunch of code and members from struct
mnt_namespace itself.
There's a bunch of stuff coming on top of this in the future but for
now this uses the generic namespace tree to extend a concept
introduced first for pidfs a few cycles ago. For a while now we have
supported pidfs file handles for pidfds. This has proven to be very
useful.
This extends the concept to cover namespaces as well. It is possible
to encode and decode namespace file handles using the common
name_to_handle_at() and open_by_handle_at() apis.
As with pidfs file handles, namespace file handles are exhaustive,
meaning it is not required to actually hold a reference to nsfs in
able to decode aka open_by_handle_at() a namespace file handle.
Instead the FD_NSFS_ROOT constant can be passed which will let the
kernel grab a reference to the root of nsfs internally and thus decode
the file handle.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the
namespace based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them trivially.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
The namespace file handle layout is exposed as uapi and has a stable
and extensible format. For now it simply contains the namespace
identifier, the namespace type, and the inode number. The stable
format means that userspace may construct its own namespace file
handles without going through name_to_handle_at() as they are already
allowed for pidfs and cgroup file handles"
* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
ns: drop assert
ns: move ns type into struct ns_common
nstree: make struct ns_tree private
ns: add ns_debug()
ns: simplify ns_common_init() further
cgroup: add missing ns_common include
ns: use inode initializer for initial namespaces
selftests/namespaces: verify initial namespace inode numbers
ns: rename to __ns_ref
nsfs: port to ns_ref_*() helpers
net: port to ns_ref_*() helpers
uts: port to ns_ref_*() helpers
ipv4: use check_net()
net: use check_net()
net-sysfs: use check_net()
user: port to ns_ref_*() helpers
time: port to ns_ref_*() helpers
pid: port to ns_ref_*() helpers
ipc: port to ns_ref_*() helpers
cgroup: port to ns_ref_*() helpers
...
|
|
Support the generic ns lookup infrastructure to support file handles for
namespaces.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
BootLoaders (Grub, LILO, etc) may pass an identifier such as "BOOT_IMAGE=
/boot/vmlinuz-x.y.z" to kernel parameters. But these identifiers are not
recognized by the kernel itself so will be passed to userspace. However
user space init program also don't recognize it.
KEXEC/KDUMP (kexec-tools) may also pass an identifier such as "kexec" on
some architectures.
We cannot change BootLoader's behavior, because this behavior exists for
many years, and there are already user space programs search BOOT_IMAGE=
in /proc/cmdline to obtain the kernel image locations:
https://github.com/linuxdeepin/deepin-ab-recovery/blob/master/util.go
(search getBootOptions)
https://github.com/linuxdeepin/deepin-ab-recovery/blob/master/main.go
(search getKernelReleaseWithBootOption) So the the best way is handle
(ignore) it by the kernel itself, which can avoid such boot warnings (if
we use something like init=/bin/bash, bootloader identifier can even cause
a crash):
Kernel command line: BOOT_IMAGE=(hd0,1)/vmlinuz-6.x root=/dev/sda3 ro console=tty
Unknown kernel command line parameters "BOOT_IMAGE=(hd0,1)/vmlinuz-6.x", will be passed to user space.
[chenhuacai@loongson.cn: use strstarts()]
Link: https://lkml.kernel.org/r/20250815090120.1569947-1-chenhuacai@loongson.cn
Link: https://lkml.kernel.org/r/20250721101343.3283480-1-chenhuacai@loongson.cn
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Steven Rostedt reported a crash with "ftrace=function" kernel command
line:
[ 0.159269] BUG: kernel NULL pointer dereference, address: 000000000000001c
[ 0.160254] #PF: supervisor read access in kernel mode
[ 0.160975] #PF: error_code(0x0000) - not-present page
[ 0.161697] PGD 0 P4D 0
[ 0.162055] Oops: Oops: 0000 [#1] SMP PTI
[ 0.162619] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc2-test-00006-g48d06e78b7cb-dirty #9 PREEMPT(undef)
[ 0.164141] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 0.165439] RIP: 0010:kmem_cache_alloc_noprof (mm/slub.c:4237)
[ 0.166186] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 e4 f0 48 83 ec 20 8b 05 c9 b6 7e 01 <44> 8b 77 1c 65 4c 8b 2d b5 ea 20 02 4c 89 6c 24 18 41 89 f5 21 f0
[ 0.168811] RSP: 0000:ffffffffb2e03b30 EFLAGS: 00010086
[ 0.169545] RAX: 0000000001fff33f RBX: 0000000000000000 RCX: 0000000000000000
[ 0.170544] RDX: 0000000000002800 RSI: 0000000000002800 RDI: 0000000000000000
[ 0.171554] RBP: ffffffffb2e03b80 R08: 0000000000000004 R09: ffffffffb2e03c90
[ 0.172549] R10: ffffffffb2e03c90 R11: 0000000000000000 R12: 0000000000000000
[ 0.173544] R13: ffffffffb2e03c90 R14: ffffffffb2e03c90 R15: 0000000000000001
[ 0.174542] FS: 0000000000000000(0000) GS:ffff9d2808114000(0000) knlGS:0000000000000000
[ 0.175684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.176486] CR2: 000000000000001c CR3: 000000007264c001 CR4: 00000000000200b0
[ 0.177483] Call Trace:
[ 0.177828] <TASK>
[ 0.178123] mas_alloc_nodes (lib/maple_tree.c:176 (discriminator 2) lib/maple_tree.c:1255 (discriminator 2))
[ 0.178692] mas_store_gfp (lib/maple_tree.c:5468)
[ 0.179223] execmem_cache_add_locked (mm/execmem.c:207)
[ 0.179870] execmem_alloc (mm/execmem.c:213 mm/execmem.c:313 mm/execmem.c:335 mm/execmem.c:475)
[ 0.180397] ? ftrace_caller (arch/x86/kernel/ftrace_64.S:169)
[ 0.180922] ? __pfx_ftrace_caller (arch/x86/kernel/ftrace_64.S:158)
[ 0.181517] execmem_alloc_rw (mm/execmem.c:487)
[ 0.182052] arch_ftrace_update_trampoline (arch/x86/kernel/ftrace.c:266 arch/x86/kernel/ftrace.c:344 arch/x86/kernel/ftrace.c:474)
[ 0.182778] ? ftrace_caller_op_ptr (arch/x86/kernel/ftrace_64.S:182)
[ 0.183388] ftrace_update_trampoline (kernel/trace/ftrace.c:7947)
[ 0.184024] __register_ftrace_function (kernel/trace/ftrace.c:368)
[ 0.184682] ftrace_startup (kernel/trace/ftrace.c:3048)
[ 0.185205] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210)
[ 0.185877] register_ftrace_function_nolock (kernel/trace/ftrace.c:8717)
[ 0.186595] register_ftrace_function (kernel/trace/ftrace.c:8745)
[ 0.187254] ? __pfx_function_trace_call (kernel/trace/trace_functions.c:210)
[ 0.187924] function_trace_init (kernel/trace/trace_functions.c:170)
[ 0.188499] tracing_set_tracer (kernel/trace/trace.c:5916 kernel/trace/trace.c:6349)
[ 0.189088] register_tracer (kernel/trace/trace.c:2391)
[ 0.189642] early_trace_init (kernel/trace/trace.c:11075 kernel/trace/trace.c:11149)
[ 0.190204] start_kernel (init/main.c:970)
[ 0.190732] x86_64_start_reservations (arch/x86/kernel/head64.c:307)
[ 0.191381] x86_64_start_kernel (??:?)
[ 0.191955] common_startup_64 (arch/x86/kernel/head_64.S:419)
[ 0.192534] </TASK>
[ 0.192839] Modules linked in:
[ 0.193267] CR2: 000000000000001c
[ 0.193730] ---[ end trace 0000000000000000 ]---
The crash happens because on x86 ftrace allocations from execmem require
maple tree to be initialized.
Move maple tree initialization that depends only on slab availability
earlier in boot so that it will happen right after mm_core_init().
Link: https://lkml.kernel.org/r/20250824130759.1732736-1-rppt@kernel.org
Fixes: 5d79c2be5081 ("x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Closes: https://lore.kernel.org/all/20250820184743.0302a8b5@gandalf.local.home/
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
|
|
Avoid silently ignoring the initramfs when the file specified in rdinit is
not usable. This prints an error that clearly explains the issue (file
was not found, vs initramfs was not found).
Link: https://lkml.kernel.org/r/20250707091411.1412681-1-lillian@star-ark.net
Signed-off-by: Lillian Berry <lillian@star-ark.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The EFI call in start_kernel() is guarded by #ifdef CONFIG_X86. Move
the thing to the arch_cpu_finalize_init() path on x86 and get rid of
the #ifdef in start_kernel().
No functional change intended.
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/all/20250620135325.3300848-5-kirill.shutemov%40linux.intel.com
|
|
After commit a934a57a42f64a4 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") and 7d95680d64ac8e836c ("scripts/misc-check:
check unnecessary #include <linux/export.h> when W=1"), we get some build
warnings with W=1:
init/main.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
init/initramfs.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
So fix these build warnings for the init code.
Link: https://lkml.kernel.org/r/20250608141235.155206-1-chenhuacai@loongson.cn
Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
|
Patch series "Use kmem_cache for memcg alloc", v3.
(willy tldr: "you've gone from allocating 8 objects per 32KiB to
allocating 13 objects per 32KiB, a 62% improvement in memory consumption"
[1])
The mem_cgroup_alloc function creates mem_cgroup struct and it's
associated structures including mem_cgroup_per_node. Through detailed
analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel, 1 NUMA node,
memcgv2 with nokmem,nosocket,cgroup_disable=pressure), we can observe the
memory allocation for these structures using the following shell commands:
# Enable tracing
echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable
echo 1 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup
# Trigger allocation if cgroup subtree do not enable memcg
echo +memory > /sys/fs/cgroup/cgroup.subtree_control
Ftrace Output:
# mem_cgroup struct allocation
sh-6312 [000] ..... 58015.698365: kmalloc:
call_site=mem_cgroup_css_alloc+0xd8/0x5b4
ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096
gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
# mem_cgroup_per_node allocation
sh-6312 [000] ..... 58015.698389: kmalloc:
call_site=mem_cgroup_css_alloc+0x1d8/0x5b4
ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096
gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false
Key Observations:
1. Both structures use kmalloc with requested sizes between 2KB-4KB
2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
(64B, 128B,..., 2KB, 4KB, 8KB)
3. Memory waste per memcg instance:
Base struct: 4096 - 2312 = 1784 bytes
Per-node struct: 4096 - 2896 = 1200 bytes
Total waste: 2984 bytes (1-node system)
NUMA scaling: (1200 + 8) * nr_node_ids bytes
So, it's a little waste.
This patchset introduces dedicated kmem_cache:
Patch2 - mem_cgroup kmem_cache - memcg_cachep
Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep
The benefits of this change can be observed with the following tracing
commands:
# Enable tracing
echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable
echo 1 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup
# In another terminal:
echo +memory > /sys/fs/cgroup/cgroup.subtree_control
The output might now look like this:
# mem_cgroup struct allocation
sh-9827 [000] ..... 289.513598: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806
bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
accounted=false
# mem_cgroup_per_node allocation
sh-9827 [000] ..... 289.513602: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a
bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
accounted=false
This indicates that the `mem_cgroup` struct now requests 2312 bytes and is
allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes and
is allocated 2944 bytes. The slight increase in allocated size is due to
`SLAB_HWCACHE_ALIGN` in the `kmem_cache`.
Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
# mem_cgroup struct allocation
sh-9269 [003] ..... 80.396366: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
accounted=false
# mem_cgroup_per_node allocation
sh-9269 [003] ..... 80.396411: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
accounted=false
While the `bytes_alloc` now matches the `bytes_req`, this patchset
defaults to using `SLAB_HWCACHE_ALIGN` as it is generally considered more
beneficial for performance. Please let me know if there are any issues or
if I've misunderstood anything.
This patchset also move mem_cgroup_init ahead of cgroup_init() due to
cgroup_init() will allocate root_mem_cgroup, but each initcall invoke
after cgroup_init, so if each kmem_cache do not prepare, we need testing
NULL before use it.
This patch (of 3):
When cgroup_init() creates root_mem_cgroup through css_alloc callback,
some critical resources might not be fully initialized, forcing later
operations to perform conditional checks for resource availability.
This patch move mem_cgroup_init() to address the init order, it invoke
before cgroup_init, so, compare to subsys_initcall, it can use to prepare
some key resources before root_mem_cgroup alloc.
Link: https://lkml.kernel.org/r/aAsRCj-niMMTtmK8@casper.infradead.org [1]
Link: https://lkml.kernel.org/r/20250425031935.76411-1-link@vivo.com
Link: https://lkml.kernel.org/r/20250425031935.76411-2-link@vivo.com
Signed-off-by: Huan Yang <link@vivo.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Francesco Valla <francesco@valla.it>
Cc: guoweikang <guoweikang.kernel@gmail.com>
Cc: Huang Shijie <shijie@os.amperecomputing.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Raul E Rangel <rrangel@chromium.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When initcall_debug is specified on the command line, the start and return
point for each initcall is printed. However, no information on the
initcall level is reported.
Add to the initcall_debug infrastructure an additional print that informs
when a new initcall level is entered. This is particularly useful when
debugging dependency chains and/or working on boot time reduction.
Link: https://lkml.kernel.org/r/20250316205014.2830071-2-francesco@valla.it
Signed-off-by: Francesco Valla <francesco@valla.it>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move this API to the canonical timers_*() namespace.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-8-mingo@kernel.org
|
|
The timer and hrtimer softirq processing has moved to dedicated threads
for kernels built with CONFIG_IRQ_FORCED_THREADING=y. This results in
timers not expiring until later in early boot, which in turn causes the
RCU Tasks self-tests to hang in kernels built with CONFIG_PROVE_RCU=y,
which further causes the entire kernel to hang. One fix would be to
make timers work during this time, but there are no known users of RCU
Tasks grace periods during that time, so no justification for the added
complexity. Not yet, anyway.
This commit therefore moves the call to rcu_init_tasks_generic() from
kernel_init_freeable() to a core_initcall(). This works because the
timer and hrtimer kthreads are created at early_initcall() time.
Fixes: 49a17639508c3 ("softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <linux-trace-kernel@vger.kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
|
|
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory. In most cases, when memory allocation fails, an
immediate panic is required. To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`. This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.
[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a separate initialization of kvfree_rcu() functionality.
For such purpose a kfree_rcu_batch_init() is renamed to a kvfree_rcu_init()
and it is invoked from the main.c right after rcu_init() is done.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Today we are proxying the `console=` command line args to the
`param_setup_earlycon()` handler. This is done because the following are
equivalent:
console=uart[8250],mmio,<addr>[,options]
earlycon=uart[8250],mmio,<addr>[,options]
Both invocations enable an early `bootconsole`. `console=uartXXXX` is
just an alias for `earlycon=uartXXXX`.
In addition, when `earlycon=` (empty value) or just `earlycon`
(no value) is specified on the command line, we enable the earlycon
`bootconsole` specified by the SPCR table or the DT.
The problem arises when `console=` (empty value) is specified on the
command line. It's intention is to disable the `console`, but what
happens instead is that the SPRC/DT console gets enabled.
This happens because we are proxying the `console=` (empty value)
parameter to the `earlycon` handler. The `earlycon` handler then sees
that the parameter value is empty, so it enables the SPCR/DT
`bootconsole`.
This change makes it so that the `console` or `console=` parameters no
longer enable the SPCR/DT `bootconsole`. I also cleans up the hack in
`main.c` that would forward the `console` parameter to the `earlycon`
handler.
Signed-off-by: Raul E Rangel <rrangel@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240911123507.v2.1.Id08823b2f848237ae90ce5c5fa7e027e97c33ad3@changeid
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
With LSMs using static calls and static keys, early_lsm_init needs to
wait for setup_arch for architecture specific functionality which
includes jump tables and static calls to be initialized.
Since not all architectures call jump_table_init in setup_arch,
explicitly call both jump_table_init and static_call_init before
early_security_init.
This only affects "early LSMs" i.e. only lockdown when
CONFIG_SECURITY_LOCKDOWN_LSM_EARLY is set.
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-mm updates from Andrew Morton:
"Mainly singleton patches, documented in their respective changelogs.
Notable series include:
- Some maintenance and performance work for ocfs2 in Heming Zhao's
series "improve write IO performance when fragmentation is high".
- Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes
exposed by fstests".
- kfifo header rework from Andy Shevchenko in the series "kfifo:
Clean up kfifo.h".
- GDB script fixes from Florian Rommel in the series "scripts/gdb:
Fixes for $lx_current and $lx_per_cpu".
- After much discussion, a coding-style update from Barry Song
explaining one reason why inline functions are preferred over
macros. The series is "codingstyle: avoid unused parameters for a
function-like macro""
* tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits)
fs/proc: fix softlockup in __read_vmcore
nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON()
scripts: checkpatch: check unused parameters for function-like macro
Documentation: coding-style: ask function-like macros to evaluate parameters
nilfs2: use __field_struct() for a bitwise field
selftests/kcmp: remove unused open mode
nilfs2: remove calls to folio_set_error() and folio_clear_error()
kernel/watchdog_perf.c: tidy up kerneldoc
watchdog: allow nmi watchdog to use raw perf event
watchdog: handle comma separated nmi_watchdog command line
nilfs2: make superblock data array index computation sparse friendly
squashfs: remove calls to set the folio error flag
squashfs: convert squashfs_symlink_read_folio to use folio APIs
scripts/gdb: fix detection of current CPU in KGDB
scripts/gdb: make get_thread_info accept pointers
scripts/gdb: fix parameter handling in $lx_per_cpu
scripts/gdb: fix failing KGDB detection during probe
kfifo: don't use "proxy" headers
media: stih-cec: add missing io.h
media: rc: add missing io.h
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull bootconfig updates from Masami Hiramatsu:
- Do not put unneeded quotes on the extra command line items which was
inserted from the bootconfig.
- Remove redundant spaces from the extra command line.
* tag 'bootconfig-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
init/main.c: Minor cleanup for the setup_command_line() function
init/main.c: Remove redundant space from saved_command_line
bootconfig: do not put quotes on cmdline items unless necessary
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Over a dozen code generation micro-optimizations for the atomic
and spinlock code
- Add more __ro_after_init attributes
- Robustify the lockdevent_*() macros
* tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro
locking/qspinlock/x86: Micro-optimize virt_spin_lock()
locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu()
locking/atomic/x86: Introduce arch_try_cmpxchg64_local()
locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock()
locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h
locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()
locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()
locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions
locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions
locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32
locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32
locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64
locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}()
locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128()
x86/tsc: Make __use_tsc __ro_after_init
x86/kvm: Make kvm_async_pf_enabled __ro_after_init
context_tracking: Make context_tracking_key __ro_after_init
jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
locking/qspinlock: Always evaluate lockevent* non-event parameter once
|
|
During the kernel booting, the generic cpu_to_node() is called too early
in arm64, powerpc and riscv when CONFIG_NUMA is enabled.
There are at least four places in the common code where
the generic cpu_to_node() is called before it is initialized:
1.) early_trace_init() in kernel/trace/trace.c
2.) sched_init() in kernel/sched/core.c
3.) init_sched_fair_class() in kernel/sched/fair.c
4.) workqueue_init_early() in kernel/workqueue.c
This will harm performance since there is an increase in off node
accesses.
In order to fix the bug, the patch introduces early_numa_node_init() which
is called after smp_prepare_boot_cpu() in start_kernel.
early_numa_node_init will initialize the "numa_node" as soon as the
early_cpu_to_node() is ready, before the cpu_to_node() is called at the
first time.
Link: https://lkml.kernel.org/r/20240126064451.5465-1-shijie@os.amperecomputing.com
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [RISC-V]
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Kelley (LINUX) <mikelley@microsoft.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When trying to migrate to using bootconfig to embed the kernel's and
PID1's command line with the kernel image itself, and so allowing changing
that without modifying the bootloader, I noticed that /proc/cmdline
changed from e.g.
console=ttymxc0,115200n8 cma=128M quiet -- --log-level=notice
to
console="ttymxc0,115200n8" cma="128M" quiet -- --log-level="notice"
The kernel parameters are parsed just fine, and the quotes are indeed
stripped from the actual argv[] given to PID1. However, the quoting
doesn't really serve any purpose and looks excessive, and might confuse
some (naive) userspace tool trying to parse /proc/cmdline. So do not
quote the value unless it contains whitespace.
Link: https://lkml.kernel.org/r/20240320101952.62135-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is just a minor cleanup to make the code look a bit cleaner.
Link: https://lore.kernel.org/all/20240412081733.35925-3-ytcoode@gmail.com/
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
There is a space at the end of extra_init_args. In the current logic,
copying extra_init_args to saved_command_line will cause extra spaces
in saved_command_line here or there. Remove the trailing space from
extra_init_args to make the string in saved_command_line look more perfect.
Link: https://lore.kernel.org/all/20240412032950.12687-1-ytcoode@gmail.com/
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When trying to migrate to using bootconfig to embed the kernel's and
PID1's command line with the kernel image itself, and so allowing
changing that without modifying the bootloader, I noticed that
/proc/cmdline changed from e.g.
console=ttymxc0,115200n8 cma=128M quiet -- --log-level=notice
to
console="ttymxc0,115200n8" cma="128M" quiet -- --log-level="notice"
The kernel parameters are parsed just fine, and the quotes are indeed
stripped from the actual argv[] given to PID1. However, the quoting
doesn't really serve any purpose and looks excessive, and might
confuse some (naive) userspace tool trying to parse /proc/cmdline. So
do not quote the value unless it contains whitespace.
Link: https://lore.kernel.org/all/20240308124401.1702046-1-linux@rasmusvillemoes.dk/
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
We allocate memory of size 'xlen + strlen(boot_command_line) + 1' for
static_command_line, but the strings copied into static_command_line are
extra_command_line and command_line, rather than extra_command_line and
boot_command_line.
When strlen(command_line) > strlen(boot_command_line), static_command_line
will overflow.
This patch just recovers strlen(command_line) which was miss-consolidated
with strlen(boot_command_line) in the commit f5c7310ac73e ("init/main: add
checks for the return value of memblock_alloc*()")
Link: https://lore.kernel.org/all/20240412081733.35925-2-ytcoode@gmail.com/
Fixes: f5c7310ac73e ("init/main: add checks for the return value of memblock_alloc*()")
Cc: stable@vger.kernel.org
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
If the "bootconfig" kernel command-line argument was specified or if
the kernel was built with CONFIG_BOOT_CONFIG_FORCE, but if there are
no embedded kernel parameter, omit the "# Parameters from bootloader:"
comment from the /proc/bootconfig file. This will cause automation
to fall back to the /proc/cmdline file, which will be identical to the
comment in this no-embedded-kernel-parameters case.
Link: https://lore.kernel.org/all/20240409044358.1156477-2-paulmck@kernel.org/
Fixes: 8b8ce6c75430 ("fs/proc: remove redundant comments from /proc/bootconfig")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When a static_key is marked ro_after_init, its state will never change
(after init), therefore jump_label_update() will never need to iterate
the entries, and thus module load won't actually need to track this --
avoiding the static_key::next write.
Therefore, mark these keys such that jump_label_add_module() might
recognise them and avoid the modification.
Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
to denote such keys.
jump_label_add_module() does not exist under CONFIG_JUMP_LABEL=n, so the
newly-introduced jump_label_init_ro() can be defined as a nop for that
configuration.
[ mingo: Renamed jump_label_ro() to jump_label_init_ro() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-2-vschneid@redhat.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
- Kuan-Wei Chiu has also sped up the library sorting code in the series
"lib/sort: Optimize the number of swaps and comparisons".
- Alexey Gladkov has added the ability for code running within an IPC
namespace to alter its IPC and MQ limits. The series is "Allow to
change ipc/mq sysctls inside ipc namespace".
- Geert Uytterhoeven has contributed some dhrystone maintenance work in
the series "lib: dhry: miscellaneous cleanups".
- Ryusuke Konishi continues nilfs2 maintenance work in the series
"nilfs2: eliminate kmap and kmap_atomic calls"
"nilfs2: fix kernel bug at submit_bh_wbc()"
- Nathan Chancellor has updated our build tools requirements in the
series "Bump the minimum supported version of LLVM to 13.0.1".
- Muhammad Usama Anjum continues with the selftests maintenance work in
the series "selftests/mm: Improve run_vmtests.sh".
- Oleg Nesterov has done some maintenance work against the signal code
in the series "get_signal: minor cleanups and fix".
Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.
* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
nilfs2: prevent kernel bug at submit_bh_wbc()
nilfs2: fix failure to detect DAT corruption in btree and direct mappings
ocfs2: enable ocfs2_listxattr for special files
ocfs2: remove SLAB_MEM_SPREAD flag usage
assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
buildid: use kmap_local_page()
watchdog/core: remove sysctl handlers from public header
nilfs2: use div64_ul() instead of do_div()
mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
kexec: copy only happens before uchunk goes to zero
get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
get_signal: don't abuse ksig->info.si_signo and ksig->sig
const_structs.checkpatch: add device_type
Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
list: leverage list_is_head() for list_entry_is_head()
nilfs2: MAINTAINERS: drop unreachable project mirror site
smp: make __smp_processor_id() 0-argument macro
fat: fix uninitialized field in nostale filehandles
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull modules updates from Luis Chamberlain:
"Christophe Leroy did most of the work on this release, first with a
few cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error
handling for when set_memory_XX() can fail.
This is part of a larger effort to clean up all these callers which
can fail, modules is just part of it"
* tag 'modules-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: Don't ignore errors from set_memory_XX()
lib/test_kmod: fix kernel-doc warnings
powerpc: Simplify strict_kernel_rwx_enabled()
modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled
init: Declare rodata_enabled and mark_rodata_ro() at all time
module: Change module_enable_{nx/x/ro}() to more explicit names
module: Use set_memory_rox()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, including a large series from Thomas Gleixner to cure
sparse warnings"
* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Drop unused declaration of proc_nmi_enabled()
x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
x86/percpu: Cure per CPU madness on UP
smp: Consolidate smp_prepare_boot_cpu()
x86/msr: Add missing __percpu annotations
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
perf/x86/amd/uncore: Fix __percpu annotation
x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
x86/apm_32: Remove dead function apm_get_battery_status()
x86/insn-eval: Fix function param name in get_eff_addr_sib()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
"Rework of APIC enumeration and topology evaluation.
The current implementation has a couple of shortcomings:
- It fails to handle hybrid systems correctly.
- The APIC registration code which handles CPU number assignents is
in the middle of the APIC code and detached from the topology
evaluation.
- The various mechanisms which enumerate APICs, ACPI, MPPARSE and
guest specific ones, tweak global variables as they see fit or in
case of XENPV just hack around the generic mechanisms completely.
- The CPUID topology evaluation code is sprinkled all over the vendor
code and reevaluates global variables on every hotplug operation.
- There is no way to analyze topology on the boot CPU before bringing
up the APs. This causes problems for infrastructure like PERF which
needs to size certain aspects upfront or could be simplified if
that would be possible.
- The APIC admission and CPU number association logic is
incomprehensible and overly complex and needs to be kept around
after boot instead of completing this right after the APIC
enumeration.
This update addresses these shortcomings with the following changes:
- Rework the CPUID evaluation code so it is common for all vendors
and provides information about the APIC ID segments in a uniform
way independent of the number of segments (Thread, Core, Module,
..., Die, Package) so that this information can be computed instead
of rewriting global variables of dubious value over and over.
- A few cleanups and simplifcations of the APIC, IO/APIC and related
interfaces to prepare for the topology evaluation changes.
- Seperation of the parser stages so the early evaluation which tries
to find the APIC address can be seperately overridden from the late
evaluation which enumerates and registers the local APIC as further
preparation for sanitizing the topology evaluation.
- A new registration and admission logic which
- encapsulates the inner workings so that parsers and guest logic
cannot longer fiddle in it
- uses the APIC ID segments to build topology bitmaps at
registration time
- provides a sane admission logic
- allows to detect the crash kernel case, where CPU0 does not run
on the real BSP, automatically. This is required to prevent
sending INIT/SIPI sequences to the real BSP which would reset
the whole machine. This was so far handled by a tedious command
line parameter, which does not even work in nested crash
scenarios.
- Associates CPU number after the enumeration completed and
prevents the late registration of APICs, which was somehow
tolerated before.
- Converting all parsers and guest enumeration mechanisms over to the
new interfaces.
This allows to get rid of all global variable tweaking from the
parsers and enumeration mechanisms and sanitizes the XEN[PV]
handling so it can use CPUID evaluation for the first time.
- Mopping up existing sins by taking the information from the APIC ID
segment bitmaps.
This evaluates hybrid systems correctly on the boot CPU and allows
for cleanups and fixes in the related drivers, e.g. PERF.
The series has been extensively tested and the minimal late fallout
due to a broken ACPI/MADT table has been addressed by tightening the
admission logic further"
* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
x86/topology: Ignore non-present APIC IDs in a present package
x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
smp: Provide 'setup_max_cpus' definition on UP too
smp: Avoid 'setup_max_cpus' namespace collision/shadowing
x86/bugs: Use fixed addressing for VERW operand
x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
x86/cpu/topology: Provide __num_[cores|threads]_per_package
x86/cpu/topology: Rename topology_max_die_per_package()
x86/cpu/topology: Rename smp_num_siblings
x86/cpu/topology: Retrieve cores per package from topology bitmaps
x86/cpu/topology: Use topology logical mapping mechanism
x86/cpu/topology: Provide logical pkg/die mapping
x86/cpu/topology: Simplify cpu_mark_primary_thread()
x86/cpu/topology: Mop up primary thread mask handling
x86/cpu/topology: Use topology bitmaps for sizing
x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
x86/xen/smp_pv: Count number of vCPUs early
x86/cpu/topology: Assign hotpluggable CPUIDs during init
x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
x86/topology: Add a mechanism to track topology via APIC IDs
...
|
|
Pull workqueue updates from Tejun Heo:
"This cycle, a lot of workqueue changes including some that are
significant and invasive.
- During v6.6 cycle, unbound workqueues were updated so that they are
more topology aware and flexible, which among other things improved
workqueue behavior on modern multi-L3 CPUs. In the process, commit
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") switched unbound workqueues to use per-CPU
frontend pool_workqueues as a part of increasing front-back mapping
flexibility.
An unwelcome side effect of this change was that this made max
concurrency enforcement per-CPU blowing up the maximum number of
allowed concurrent executions. I incorrectly assumed that this
wouldn't cause practical problems as most unbound workqueue users
are self-regulate max concurrency; however, there definitely are
which don't (e.g. on IO paths) and the drastic increase in the
allowed max concurrency led to noticeable perf regressions in some
use cases.
This is now addressed by separating out max concurrency enforcement
to a separate struct - wq_node_nr_active - which makes @max_active
consistently mean system-wide max concurrency regardless of the
number of CPUs or (finally) NUMA nodes. This is a rather invasive
and, in places, a bit clunky; however, the clunkiness rises from
the the inherent requirement to handle the disagreement between the
execution locality domain and max concurrency enforcement domain on
some modern machines.
See commit 5797b1c18919 ("workqueue: Implement system-wide
nr_active enforcement for unbound workqueues") for more details.
- BH workqueue support is added.
They are similar to per-CPU workqueues but execute work items in
the softirq context. This is expected to replace tasklet. However,
currently, it's missing the ability to disable and enable work
items which is needed to convert many tasklet users. To avoid
crowding this merge window too much, this will be included in the
next merge window. A separate pull request will be sent for the
couple conversion patches that are currently pending.
- Waiman plugged a long-standing hole in workqueue CPU isolation
where ordered workqueues didn't follow wq_unbound_cpumask updates.
Ordered workqueues now follow the same rules as other unbound
workqueues.
- More CPU isolation improvements: Juri fixed another deficit in
workqueue isolation where unbound rescuers don't respect
wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on
isolated CPUs.
- Other misc changes"
* tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits)
workqueue: Drain BH work items on hot-unplugged CPUs
workqueue: Introduce from_work() helper for cleaner callback declarations
workqueue: Control intensive warning threshold through cmdline
workqueue: Make @flags handling consistent across set_work_data() and friends
workqueue: Remove clear_work_data()
workqueue: Factor out work_grab_pending() from __cancel_work_sync()
workqueue: Clean up enum work_bits and related constants
workqueue: Introduce work_cancel_flags
workqueue: Use variable name irq_flags for saving local irq flags
workqueue: Reorganize flush and cancel[_sync] functions
workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
workqueue: Cosmetic changes
workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
workqueue: Fix queue_work_on() with BH workqueues
async: Use a dedicated unbound workqueue with raised min_active
workqueue: Implement workqueue_set_min_active()
workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
kernel/workqueue: Let rescuers follow unbound wq cpumask changes
...
|
|
The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking. It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.
Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier(). To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).
Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.
Even worse, the rcu_barrier() can introduce significant delay. Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.
[ 0.291444] Freeing unused kernel memory: 5568K
[ 0.402442] Run /sbin/init as init process
With this fix, the above delay can be eliminated.
Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is no point in having seven architectures implementing the same empty
stub.
Provide a weak function in the init code and remove the stubs.
This also allows to utilize the function on UP which is required to
sanitize the per CPU handling on X86 UP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.567671691@linutronix.de
|
|
This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:
* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
file->private_data. This means it is now possible to introduce
concepts that operate on a process once all file descriptors have been
closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
pid will refer to the same inode if it's opened multiple times. In
contrast to now where each struct pid refers to the same inode. Even
if we were to move to anon_inode_create_getfile() which creates new
inodes we'd still be associating the same struct pid with multiple
different inodes.
The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.
We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.
The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.
Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This was already defined locally by init/main.c, but let's make
it generic, as arch/x86/kernel/cpu/topology.c is going to make
use of it to have more uniform code.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Since commit 3570ee046c46b5dc ("s390/smp: keep the original lowcore for
CPU 0"), there is no longer any architecture that needs to override
arch_call_rest_init().
Remove the weak wrapper around rest_init(), call rest_init() directly, and
make rest_init() static.
Link: https://lkml.kernel.org/r/aa10868bfb176eef4abb8bb4a710b85330792694.1706106183.git.geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All architectures using the core ptdump functionality also implement
CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
function called debug_checkwx() that is called by mark_rodata_ro(), which
is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a
no-op otherwise.
Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call
debug_checkwx() immediately after calling mark_rodata_ro() instead of
calling it at the end of every mark_rodata_ro().
On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX
before calling debug_checkwx(). Now the check is inside the callee
ptdump_walk_pgd_level_checkwx().
On powerpc_64, mark_rodata_ro() bails out early before calling
ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check
is now also done in ptdump_check_wx() as it is called outside
mark_rodata_ro().
Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Phong Tran <tranmanphong@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Price <steven.price@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Async can schedule a number of interdependent work items. However, since
5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for
unbound workqueues"), unbound workqueues have separate min_active which sets
the number of interdependent work items that can be handled. This default
value is 8 which isn't sufficient for async and can lead to stalls during
resume from suspend in some cases.
Let's use a dedicated unbound workqueue with raised min_active.
Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Declaring rodata_enabled and mark_rodata_ro() at all time
helps removing related #ifdefery in C files.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
The code that checks for unknown boot options is unaware of the sysctl
alias facility, which maps bootparams to sysctl values. If a user sets
an old value that has a valid alias, a message about an invalid
parameter will be printed during boot, and the parameter will get passed
to init. Fix by checking for the existence of aliased parameters in the
unknown boot parameter code. If an alias exists, don't return an error
or pass the value to init.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: 0a477e1ae21b ("kernel/sysctl: support handling command line aliases")
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.
Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().
As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.
Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.
This patch changes the initialization sequence but the end result should be
the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-mm updates from Andrew Morton:
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
directories
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs
- Zhen Lei has enhanced kexec's ability to support two crash regions
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries
- And the usual bunch of singleton patches in various places
* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
kernel/time/posix-stubs.c: remove duplicated include
ocfs2: remove redundant assignment to variable bit_off
watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
devres: show which resource was invalid in __devm_ioremap_resource()
watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
watchdog/hardlockup: make the config checks more straightforward
watchdog/hardlockup: sort hardlockup detector related config values a logical way
watchdog/hardlockup: move SMP barriers from common code to buddy code
watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molar:
"Build footprint & performance improvements:
- Reduce memory usage with CONFIG_DEBUG_INFO=y
In the worst case of an allyesconfig+CONFIG_DEBUG_INFO=y kernel,
DWARF creates almost 200 million relocations, ballooning objtool's
peak heap usage to 53GB. These patches reduce that to 25GB.
On a distro-type kernel with kernel IBT enabled, they reduce
objtool's peak heap usage from 4.2GB to 2.8GB.
These changes also improve the runtime significantly.
Debuggability improvements:
- Add the unwind_debug command-line option, for more extend unwinding
debugging output
- Limit unreachable warnings to once per function
- Add verbose option for disassembling affected functions
- Include backtrace in verbose mode
- Detect missing __noreturn annotations
- Ignore exc_double_fault() __noreturn warnings
- Remove superfluous global_noreturns entries
- Move noreturn function list to separate file
- Add __kunit_abort() to noreturns
Unwinder improvements:
- Allow stack operations in UNWIND_HINT_UNDEFINED regions
- drm/vmwgfx: Add unwind hints around RBP clobber
Cleanups:
- Move the x86 entry thunk restore code into thunk functions
- x86/unwind/orc: Use swap() instead of open coding it
- Remove unnecessary/unused variables
Fixes for modern stack canary handling"
* tag 'objtool-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
x86/orc: Make the is_callthunk() definition depend on CONFIG_BPF_JIT=y
objtool: Skip reading DWARF section data
objtool: Free insns when done
objtool: Get rid of reloc->rel[a]
objtool: Shrink elf hash nodes
objtool: Shrink reloc->sym_reloc_entry
objtool: Get rid of reloc->jump_table_start
objtool: Get rid of reloc->addend
objtool: Get rid of reloc->type
objtool: Get rid of reloc->offset
objtool: Get rid of reloc->idx
objtool: Get rid of reloc->list
objtool: Allocate relocs in advance for new rela sections
objtool: Add for_each_reloc()
objtool: Don't free memory in elf_close()
objtool: Keep GElf_Rel[a] structs synced
objtool: Add elf_create_section_pair()
objtool: Add mark_sec_changed()
objtool: Fix reloc_hash size
objtool: Consolidate rel/rela handling
...
|
|
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and
remove the weak fallback from the core code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224545.670360645@linutronix.de
|
|
X86 is reworking the boot process so that initializations which are not
required during early boot can be moved into the late boot process and out
of the fragile and restricted initial boot phase.
arch_cpu_finalize_init() is the obvious place to do such initializations,
but arch_cpu_finalize_init() is invoked too late in start_kernel() e.g. for
initializing the FPU completely. fork_init() requires that the FPU is
initialized as the size of task_struct on X86 depends on the size of the
required FPU register buffer.
Fortunately none of the init calls between calibrate_delay() and
arch_cpu_finalize_init() is relevant for the functionality of
arch_cpu_finalize_init().
Invoke it right after calibrate_delay() where everything which is relevant
for arch_cpu_finalize_init() has been set up already.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20230613224545.612182854@linutronix.de
|
|
Everything is converted over to arch_cpu_finalize_init(). Remove the
check_bugs() leftovers including the empty stubs in asm-generic, alpha,
parisc, powerpc and xtensa.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Link: https://lore.kernel.org/r/20230613224545.553215951@linutronix.de
|
|
check_bugs() has become a dumping ground for all sorts of activities to
finalize the CPU initialization before running the rest of the init code.
Most are empty, a few do actual bug checks, some do alternative patching
and some cobble a CPU advertisement string together....
Aside of that the current implementation requires duplicated function
declaration and mostly empty header files for them.
Provide a new function arch_cpu_finalize_init(). Provide a generic
declaration if CONFIG_ARCH_HAS_CPU_FINALIZE_INIT is selected and a stub
inline otherwise.
This requires a temporary #ifdef in start_kernel() which will be removed
along with check_bugs() once the architectures are converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230613224544.957805717@linutronix.de
|
|
The init/main.c file contains some extern declarations for functions
defined in architecture code, and it defines some other functions that are
called from architecture code with a custom prototype. Both of those
result in warnings with 'make W=1':
init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes]
init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes]
arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes]
arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes]
arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes]
init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes]
Add prototypes for all of these in include/linux/init.h or another
appropriate header, and remove the duplicate declarations from
architecture specific code.
[sfr@canb.auug.org.au: declare time_init_early()]
Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au
Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
prevent_tail_call_optimization() was added in
commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try")
to work around stack canaries getting inserted into functions that would
initialize the stack canary in the first place.
Now that we have no_stack_protector function attribute (gcc-11+,
clang-7+) and use it on start_kernel(), remove the call to
prevent_tail_call_optimization() such that we may one day remove it
outright.
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230412-no_stackp-v2-2-116f9fe4bbe7@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
Back during the discussion of
commit a9a3ed1eff36 ("x86: Fix early boot crash on gcc-10, third try")
we discussed the need for a function attribute to control the omission
of stack protectors on a per-function basis; at the time Clang had
support for no_stack_protector but GCC did not. This was fixed in
gcc-11. Now that the function attribute is available, let's start using
it.
Callers of boot_init_stack_canary need to use this function attribute
unless they're compiled with -fno-stack-protector, otherwise the canary
stored in the stack slot of the caller will differ upon the call to
boot_init_stack_canary. This will lead to a call to __stack_chk_fail()
then panic.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722
Link: https://lore.kernel.org/all/20200316130414.GC12561@hirez.programming.kicks-ass.net/
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230412-no_stackp-v2-1-116f9fe4bbe7@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: ndesaulniers@google.com <ndesaulniers@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures &
drivers that did this inconsistently follow this new, common
convention, and fix all the fallout that objtool can now detect
statically
- Fix/improve the ORC unwinder becoming unreliable due to
UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
and UNWIND_HINT_UNDEFINED to resolve it
- Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code
- Generate ORC data for __pfx code
- Add more __noreturn annotations to various kernel startup/shutdown
and panic functions
- Misc improvements & fixes
* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/hyperv: Mark hv_ghcb_terminate() as noreturn
scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
btrfs: Mark btrfs_assertfail() __noreturn
objtool: Include weak functions in global_noreturns check
cpu: Mark nmi_panic_self_stop() __noreturn
cpu: Mark panic_smp_self_stop() __noreturn
arm64/cpu: Mark cpu_park_loop() and friends __noreturn
x86/head: Mark *_start_kernel() __noreturn
init: Mark start_kernel() __noreturn
init: Mark [arch_call_]rest_init() __noreturn
objtool: Generate ORC data for __pfx code
x86/linkage: Fix padding for typed functions
objtool: Separate prefix code from stack validation code
objtool: Remove superfluous dead_end_function() check
objtool: Add symbol iteration helpers
objtool: Add WARN_INSN()
scripts/objdump-func: Support multiple functions
context_tracking: Fix KCSAN noinstr violation
objtool: Add stackleak instrumentation to uaccess safe list
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the necessary glue so that the kernel can run as a confidential
SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the
address space in two parts: encrypted and unencrypted. The use case
being running unmodified guests on the Hyper-V confidential computing
hypervisor
- Double-buffer messages between the guest and the hardware PSP device
so that no partial buffers are copied back'n'forth and thus potential
message integrity and leak attacks are possible
- Name the return value the sev-guest driver returns when the hw PSP
device hasn't been called, explicitly
- Cleanups
* tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyperv: Change vTOM handling to use standard coco mechanisms
init: Call mem_encrypt_init() after Hyper-V hypercall init is done
x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
x86/hyperv: Reorder code to facilitate future work
x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM
x86/sev: Change snp_guest_issue_request()'s fw_err argument
virt/coco/sev-guest: Double-buffer messages
crypto: ccp: Get rid of __sev_platform_init_locked()'s local function pointer
crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALL
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull user work thread updates from Christian Brauner:
"This contains the work generalizing the ability to create a kernel
worker from a userspace process.
Such user workers will run with the same credentials as the userspace
process they were created from providing stronger security and
accounting guarantees than the traditional override_creds() approach
ever could've hoped for.
The original work was heavily based and optimzed for the needs of
io_uring which was the first user. However, as it quickly turned out
the ability to create user workers inherting properties from a
userspace process is generally useful.
The vhost subsystem currently creates workers using the kthread api.
The consequences of using the kthread api are that RLIMITs don't work
correctly as they are inherited from khtreadd. This leads to bugs
where more workers are created than would be allowed by the RLIMITs of
the userspace process in lieu of which workers are created.
Problems like this disappear with user workers created from the
userspace processes for which they perform the work. In addition,
providing this api allows vhost to remove additional complexity. For
example, cgroup and mm sharing will just work out of the box with user
workers based on the relevant userspace process instead of manually
ensuring the correct cgroup and mm contexts are used.
So the vhost subsystem should simply be made to use the same mechanism
as io_uring. To this end the original mechanism used for
create_io_thread() is generalized into user workers:
- Introduce PF_USER_WORKER as a generic indicator that a given task
is a user worker, i.e., a kernel task that was created from a
userspace process. Now a PF_IO_WORKER thread is just a specialized
version of PF_USER_WORKER. So io_uring io workers raise both flags.
- Make copy_process() available to core kernel code
- Extend struct kernel_clone_args with the following bitfields
allowing to indicate to copy_process():
- to create a user worker (raise PF_USER_WORKER)
- to not inherit any files from the userspace process
- to ignore signals
After all generic changes are in place the vhost subsystem implements
a new dedicated vhost api based on user workers. Finally, vhost is
switched to rely on the new api moving it off of kthreads.
Thanks to Mike for sticking it out and making it through this rather
arduous journey"
* tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
vhost: use vhost_tasks for worker threads
vhost: move worker thread fields to new struct
vhost_task: Allow vhost layer to use copy_process
fork: allow kernel code to call copy_process
fork: Add kernel_clone_args flag to ignore signals
fork: add kernel_clone_args flag to not dup/clone files
fork/vm: Move common PF_IO_WORKER behavior to new flag
kernel: Make io_thread and kthread bit fields
kthread: Pass in the thread's name during creation
kernel: Allow a kernel thread's name to be set in copy_process
csky: Remove kernel_thread declaration
|
|
Now that arch_call_rest_init() is __noreturn, mark its caller
start_kernel() __noreturn.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/7069acf026a195f26a88061227fba5a3b0337b9a.1681342859.git.jpoimboe@kernel.org
|
|
In preparation for improving objtool's handling of weak noreturn
functions, mark start_kernel(), arch_call_rest_init(), and rest_init()
__noreturn.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/7194ed8a989a85b98d92e62df660f4a90435a723.1681342859.git.jpoimboe@kernel.org
|
|
When deferred initialization of struct pages is enabled, page_ext_init()
must be called after all the deferred initialization is done, but there is
no point to keep it a separate call from kernel_init_freeable() right
after page_alloc_init_late().
Fold the call to page_ext_init() into page_alloc_init_late() and localize
deferred_struct_pages variable.
Link: https://lkml.kernel.org/r/20230321170513.2401534-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Make mm_init() a part of mm/ codebase. mm_core_init() better describes
what the function does and does not clash with mm_init() in kernel/fork.c
Link: https://lkml.kernel.org/r/20230321170513.2401534-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both build_all_zonelists() and page_alloc_init_cpuhp() must be called
after SMP setup is complete but before the page allocator is set up.
Still, they both are a part of memory management initialization, so move
them to mm_init().
Link: https://lkml.kernel.org/r/20230321170513.2401534-7-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page_alloc_init() name is really misleading because all this function
does is sets up CPU hotplug callbacks for the page allocator.
Rename it to page_alloc_init_cpuhp() so that name will reflect what the
function does.
Link: https://lkml.kernel.org/r/20230321170513.2401534-6-rppt@kernel.org
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Full Hyper-V initialization, including support for hypercalls, is done
as an apic_post_init callback via late_time_init(). mem_encrypt_init()
needs to make hypercalls when it marks swiotlb memory as decrypted.
But mem_encrypt_init() is currently called a few lines before
late_time_init(), so the hypercalls don't work.
Fix this by moving mem_encrypt_init() after late_time_init() and
related clock initializations. The intervening initializations don't
do any I/O that requires the swiotlb, so moving mem_encrypt_init()
slightly later has no impact.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1678329614-3482-6-git-send-email-mikelley@microsoft.com
|
|
Change no bootconfig data error message if user do not specify 'bootconfig'
option but CONFIG_BOOT_CONFIG_FORCE=y.
With CONFIG_BOOT_CONFIG_FORCE=y, the kernel proceeds bootconfig check even
if user does not specify 'bootconfig' option. So the current error message
is confusing. Let's show just an information message to notice skipping
the bootconfig in that case.
Link: https://lore.kernel.org/all/167754610254.318944.16848412476667893329.stgit@devnote2/
Fixes: b743852ccc1d ("Allow forcing unconditional bootconfig processing")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/all/CAMuHMdV9jJvE2y8gY5V_CxidUikCf5515QMZHzTA3rRGEOj6=w@mail.gmail.com/
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
|
|
This patch allows kernel users to pass in the thread name so it can be
set during creation instead of having to use set_task_comm after the
thread is created.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
|
|
The BOOT_CONFIG family of Kconfig options allows a bootconfig file
containing kernel boot parameters to be embedded into an initrd or into
the kernel itself. This can be extremely useful when deploying kernels
in cases where some of the boot parameters depend on the kernel version
rather than on the server hardware, firmware, or workload.
Unfortunately, the "bootconfig" kernel parameter must be specified in
order to cause the kernel to look for the embedded bootconfig file,
and it clearly does not help to embed this "bootconfig" kernel parameter
into that file.
Therefore, provide a new BOOT_CONFIG_FORCE Kconfig option that causes the
kernel to act as if the "bootconfig" kernel parameter had been specified.
In other words, kernels built with CONFIG_BOOT_CONFIG_FORCE=y will look
for the embedded bootconfig file even when the "bootconfig" kernel
parameter is omitted. This permits kernel-version-dependent kernel
boot parameters to be embedded into the kernel image without the need to
(for example) update large numbers of boot loaders.
Link: https://lore.kernel.org/all/20230105005838.GA1772817@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <linux-doc@vger.kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.
Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.
[pasha.tatashin@soleen.com: fix build with CONFIG_PAGE_EXTENSION=n]
Link: https://lkml.kernel.org/r/20230118155251.2522985-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20230117204617.1553748-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
"New Feature:
- Randomize the per-cpu entry areas
Cleanups:
- Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it
- Move to "native" set_memory_rox() helper
- Clean up pmd_get_atomic() and i386-PAE
- Remove some unused page table size macros"
* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/mm: Ensure forced page table splitting
x86/kasan: Populate shadow for shared chunk of the CPU entry area
x86/kasan: Add helpers to align shadow addresses up and down
x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
x86/mm: Recompute physical address for every page of per-CPU CEA mapping
x86/mm: Rename __change_page_attr_set_clr(.checkalias)
x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
x86/mm: Add a few comments
x86/mm: Fix CR3_ADDR_MASK
x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
mm: Convert __HAVE_ARCH_P..P_GET to the new style
mm: Remove pointless barrier() after pmdp_get_lockless()
x86/mm/pae: Get rid of set_64bit()
x86_64: Remove pointless set_64bit() usage
x86/mm/pae: Be consistent with pXXp_get_and_clear()
x86/mm/pae: Use WRITE_ONCE()
x86/mm/pae: Don't (ab)use atomic64
mm/gup: Fix the lockless PMD access
...
|
|
Move poking_init() up a bunch; specifically move it right after
mm_init() which is right before ftrace_init().
This will allow simplifying ftrace text poking which currently has
a bunch of exceptions for early boot.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201057.881703081@infradead.org
|
|
In order to allow using mm_alloc() much earlier, move initializing
mm_cachep into mm_init().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201057.751153381@infradead.org
|
|
Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.
cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:
let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();
will result in
openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)
open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.
In case of /proc/cmdline, the length is known precisely.
Make variables readonly while I'm at it.
P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.
Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- hfs and hfsplus kmap API modernization (Fabio Francesco)
- make crash-kexec work properly when invoked from an NMI-time panic
(Valentin Schneider)
- ntfs bugfixes (Hawkins Jiawei)
- improve IPC msg scalability by replacing atomic_t's with percpu
counters (Jiebin Sun)
- nilfs2 cleanups (Minghao Chi)
- lots of other single patches all over the tree!
* tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype
proc: test how it holds up with mapping'less process
mailmap: update Frank Rowand email address
ia64: mca: use strscpy() is more robust and safer
init/Kconfig: fix unmet direct dependencies
ia64: update config files
nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure
fork: remove duplicate included header files
init/main.c: remove unnecessary (void*) conversions
proc: mark more files as permanent
nilfs2: remove the unneeded result variable
nilfs2: delete unnecessary checks before brelse()
checkpatch: warn for non-standard fixes tag style
usr/gen_init_cpio.c: remove unnecessary -1 values from int file
ipc/msg: mitigate the lock contention with percpu counter
percpu: add percpu_counter_add_local and percpu_counter_sub_local
fs/ocfs2: fix repeated words in comments
relay: use kvcalloc to alloc page array in relay_alloc_page_array
proc: make config PROC_CHILDREN depend on PROC_FS
fs: uninline inode_maybe_inc_iversion()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
|
|
The void pointer object can be directly assigned to different structure
objects, it does not need to be cast.
Link: https://lkml.kernel.org/r/20220928014539.11046-1-zhoujie@nfschina.com
Signed-off-by: Zhou jie <zhoujie@nfschina.com>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kmsan_init_shadow() scans the mappings created at boot time and creates
metadata pages for those mappings.
When the memblock allocator returns pages to pagealloc, we reserve 2/3 of
those pages and use them as metadata for the remaining 1/3. Once KMSAN
starts, every page allocated by pagealloc has its associated shadow and
origin pages.
kmsan_initialize() initializes the bookkeeping for init_task and enables
KMSAN.
Link: https://lkml.kernel.org/r/20220915150417.722975-18-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The full RNG initialization relies on some timestamps, made possible
with initialization functions like time_init() and timekeeping_init().
However, these are only available rather late in initialization.
Meanwhile, other things, such as memory allocator functions, make use of
the RNG much earlier.
So split RNG initialization into two phases. We can provide arch
randomness very early on, and then later, after timekeeping and such are
available, initialize the rest.
This ensures that, for example, slabs are properly randomized if RDRAND
is available. Without this, CONFIG_SLAB_FREELIST_RANDOM=y loses a degree
of its security, because its random seed is potentially deterministic,
since it hasn't yet incorporated RDRAND. It also makes it possible to
use a better seed in kfence, which currently relies on only the cycle
counter.
Another positive consequence is that on systems with RDRAND, running
with CONFIG_WARN_ALL_UNSEEDED_RANDOM=y results in no warnings at all.
One subtle side effect of this change is that on systems with no RDRAND,
RDTSC is now only queried by random_init() once, committing the moment
of the function call, instead of multiple times as before. This is
intentional, as the multiple RDTSCs in a loop before weren't
accomplishing very much, with jitter being better provided by
try_to_generate_entropy(). Plus, filling blocks with RDTSC is still
being done in extract_entropy(), which is necessarily called before
random bytes are served anyway.
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
There is additional BUG_ON() calls added within the tree, most of which
are in debug code. These will be replaced with a WARN_ON() call in the
future. There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date. These exist to catch
things such as out-of-range accesses which would crash anyways.
Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220818210200.8203-1-wsa+renesas@sang-engineering.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.
This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.
[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On arm64, "rodata=full" has been suppored (but not documented) since
commit:
c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well")
As it's necessary to determine the rodata configuration early during
boot, arm64 has an early_param() handler for this, whereas init/main.c
has a __setup() handler which is run later.
Unfortunately, this split meant that since commit:
f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions")
... passing "rodata=full" would result in a spurious warning from the
__setup() handler (though RO permissions would be configured
appropriately).
Further, "rodata=full" has been broken since commit:
0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
... which caused strtobool() to parse "full" as false (in addition to
many other values not documented for the "rodata=" kernel parameter.
This patch fixes this breakage by:
* Moving the core parameter parser to an __early_param(), such that it
is available early.
* Adding an (optional) arch hook which arm64 can use to parse "full".
* Updating the documentation to mention that "full" is valid for arm64.
* Having the core parameter parser handle "on" and "off" explicitly,
such that any undocumented values (e.g. typos such as "ful") are
reported as errors rather than being silently accepted.
Note that __setup() and early_param() have opposite conventions for
their return values, where __setup() uses 1 to indicate a parameter was
handled and early_param() uses 0 to indicate a parameter was handled.
Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions")
Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jagdish Gediya <jvgediya@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Fix the following Sparse warnings that got noticed when the PPC-dev
patchwork was checking another patch (see the link below):
init/main.c:862:1: warning: symbol 'randomize_kstack_offset' was not declared. Should it be static?
init/main.c:864:1: warning: symbol 'kstack_offset' was not declared. Should it be static?
Which in fact are triggered on all architectures that have
HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET support (for instances x86, arm64
etc).
Link: https://lore.kernel.org/lkml/e7b0d68b-914d-7283-827c-101988923929@huawei.com/T/#m49b2d4490121445ce4bf7653500aba59eefcb67f
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220629060423.2515693-1-gongruiqi1@huawei.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
"This updates init and user mode helper tasks to be ordinary user mode
tasks.
Commit 40966e316f86 ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
Here, commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple
enough to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace
thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code
sitting in linux-next"
* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched: Update task_tick_numa to ignore tasks without an mm
fork: Stop allowing kthreads to call execve
fork: Explicitly set PF_KTHREAD
init: Deal with the init process being a user mode process
fork: Generalize PF_IO_WORKER handling
fork: Explicity test for idle tasks in copy_thread
fork: Pass struct kernel_clone_args into copy_thread
kthread: Don't allocate kthread_struct for init and umh
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The majority of the changes are for fixes and clean ups.
Notable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to
having it embedded in initram). This is useful for embedded boards
without initram disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use
more than 59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset> instead of using the
name of the function before it"
* tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits)
ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function
tracing: Fix comments for event_trigger_separate_filter()
x86/traceponit: Fix comment about irq vector tracepoints
x86,tracing: Remove unused headers
ftrace: Clean up hash direct_functions on register failures
tracing: Fix comments of create_filter()
tracing: Disable kcov on trace_preemptirq.c
tracing: Initialize integer variable to prevent garbage return value
ftrace: Fix typo in comment
ftrace: Remove return value of ftrace_arch_modify_*()
tracing: Cleanup code by removing init "char *name"
tracing: Change "char *" string form to "char []"
tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ
tracing/timerlat: Print stacktrace in the IRQ handler if needed
tracing/timerlat: Notify IRQ new max latency only if stop tracing is set
kprobes: Fix build errors with CONFIG_KRETPROBES=n
tracing: Fix return value of trace_pid_write()
tracing: Fix potential double free in create_var_ref()
tracing: Use strim() to remove whitespace instead of doing it manually
ftrace: Deal with error return code of the ftrace_process_locs() function
...
|
|
Currently, start_kernel() adds latent entropy and the command line to
the entropy bool *after* the RNG has been initialized, deferring when
it's actually used by things like stack canaries until the next time
the pool is seeded. This surely is not intended.
Rather than splitting up which entropy gets added where and when between
start_kernel() and random_init(), just do everything in random_init(),
which should eliminate these kinds of bugs in the future.
While we're at it, rename the awkwardly titled "rand_initialize()" to
the more standard "random_init()" nomenclature.
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Currently time_init() is called after rand_initialize(), but
rand_initialize() makes use of the timer on various platforms, and
sometimes this timer needs to be initialized by time_init() first. In
order for random_get_entropy() to not return zero during early boot when
it's potentially used as an entropy source, reverse the order of these
two calls. The block doing random initialization was right before
time_init() before, so changing the order shouldn't have any complicated
effects.
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Stafford Horne <shorne@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If kthread_is_per_cpu runs concurrently with free_kthread_struct the
kthread_struct that was just freed may be read from.
This bug was introduced by commit 40966e316f86 ("kthread: Ensure
struct kthread is present for all kthreads"). When kthread_struct
started to be allocated for all tasks that have PF_KTHREAD set. This
in turn required the kthread_struct to be freed in kernel_execve and
violated the assumption that kthread_struct will have the same
lifetime as the task.
Looking a bit deeper this only applies to callers of kernel_execve
which is just the init process and the user mode helper processes.
These processes really don't want to be kernel threads but are for
historical reasons. Mostly that copy_thread does not know how to take
a kernel mode function to the process with for processes without
PF_KTHREAD or PF_IO_WORKER set.
Solve this by not allocating kthread_struct for the init process and
the user mode helper processes.
This is done by adding a kthread member to struct kernel_clone_args.
Setting kthread in fork_idle and kernel_thread. Adding
user_mode_thread that works like kernel_thread except it does not set
kthread. In fork only allocating the kthread_struct if .kthread is set.
I have looked at kernel/kthread.c and since commit 40966e316f86
("kthread: Ensure struct kthread is present for all kthreads") there
have been no assumptions added that to_kthread or __to_kthread will
not return NULL.
There are a few callers of to_kthread or __to_kthread that assume a
non-NULL struct kthread pointer will be returned. These functions are
kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
kthread_park(), kthread_unpark(), kthread_stop(). All of those functions
can reasonably expected to be called when it is know that a task is a
kthread so that assumption seems reasonable.
Cc: stable@vger.kernel.org
Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
Reported-by: Максим Кутявин <maximkabox13@gmail.com>
Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
This allows kernel developer to embed a default bootconfig file in
the kernel instead of embedding it in the initrd. This will be good
for who are using the kernel without initrd, or who needs a default
bootconfigs.
This needs to set two kconfigs: CONFIG_BOOT_CONFIG_EMBED=y and set
the file path to CONFIG_BOOT_CONFIG_EMBED_FILE.
Note that you still need 'bootconfig' command line option to load the
embedded bootconfig. Also if you boot using an initrd with a different
bootconfig, the kernel will use the bootconfig in the initrd, instead
of the default bootconfig.
Link: https://lkml.kernel.org/r/164921227943.1090670.14035119557571329218.stgit@devnote2
Cc: Padmanabha Srinivasaiah <treasure4paddy@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Check the bootconfig's checksum before removing the bootconfig data
from initrd to avoid modifying initrd by mistake.
This will also simplifies the get_boot_config_from_initrd() interface.
Link: https://lkml.kernel.org/r/164921226891.1090670.16955839243639298134.stgit@devnote2
Cc: Padmanabha Srinivasaiah <treasure4paddy@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Linux Kbuild mailing list <linux-kbuild@vger.kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Merge more updates from Andrew Morton:
"Various misc subsystems, before getting into the post-linux-next
material.
41 patches.
Subsystems affected by this patch series: procfs, misc, core-kernel,
lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump,
taskstats, panic, kcov, resource, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"
kernel/resource: fix kfree() of bootmem memory again
kcov: properly handle subsequent mmap calls
kcov: split ioctl handling into locked and unlocked parts
panic: move panic_print before kmsg dumpers
panic: add option to dump all CPUs backtraces in panic_print
docs: sysctl/kernel: add missing bit to panic_print
taskstats: remove unneeded dead assignment
kasan: no need to unset panic_on_warn in end_report()
ubsan: no need to unset panic_on_warn in ubsan_epilogue()
panic: unset panic_on_warn inside panic()
docs: kdump: add scp example to write out the dump file
docs: kdump: update description about sysfs file system support
arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
cgroup: use irqsave in cgroup_rstat_flush_locked().
fat: use pointer to simple type in put_user()
minix: fix bug when opening a file with O_DIRECT
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"The sprinkling of SPI drivers is because we added a new one and Mark
sent us a SPI driver interface conversion pull request.
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout the
stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower iTLB
pressure and prevents identity mapping huge pages from getting
split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop the
user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if
called from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling of
TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
llc: fix netdevice reference leaks in llc_ui_bind()
drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
ice: don't allow to run ice_send_event_to_aux() in atomic ctx
ice: fix 'scheduling while atomic' on aux critical err interrupt
net/sched: fix incorrect vlan_push_eth dest field
net: bridge: mst: Restrict info size queries to bridge ports
net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
drivers: net: xgene: Fix regression in CRC stripping
net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
net: dsa: fix missing host-filtered multicast addresses
net/mlx5e: Fix build warning, detected write beyond size of field
iwlwifi: mvm: Don't fail if PPAG isn't supported
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
netdevice: add missing dm_private kdoc
net: bridge: mst: prevent NULL deref in br_mst_info_size()
selftests: forwarding: Use same VRF for port and VLAN upper
...
|
|
initcall_blacklist() should return 1 to indicate that it handled its
cmdline arguments.
set_debug_rodata() should return 1 to indicate that it handled its
cmdline arguments. Print a warning if the option string is invalid.
This prevents these strings from being added to the 'init' program's
environment as they are not init arguments/parameters.
Link: https://lkml.kernel.org/r/20220221050901.23985-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use ktime_us_delta() to make the initcall_debug log more precise than
right shifting the result of ktime_to_ns() by 10 bits.
Link: https://lkml.kernel.org/r/20220209053350.15771-1-mark-pk.tsai@mediatek.com
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Tested-by: Andrew Halaney <ahalaney@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The randomize_kstack_offset feature is unconditionally compiled in when
the architecture supports it.
To add constraints on compiler versions, we require a dedicated Kconfig
variable. Therefore, introduce RANDOMIZE_KSTACK_OFFSET.
Furthermore, this option is now also configurable by EXPERT kernels:
while the feature is supposed to have zero performance overhead when
disabled, due to its use of static branches, there are few cases where
giving a distribution the option to disable the feature entirely makes
sense. For example, in very resource constrained environments, which
would never enable the feature to begin with, in which case the
additional kernel code size increase would be redundant.
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220131090521.1947110-1-elver@google.com
|
|
While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.
This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);
This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&init_net)
We had too many prior patches [1] trying to work around the root cause.
Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.
Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()
As a bonus, this patch saves around 4KB in .data section.
[1]
f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df8019a ("net: Initialise init_net.count to 1")
734b65417b24 ("net: Statically initialize init_net.dev_base_head")
v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, enabling CONFIG_STACKDEPOT means its stack_table will be
allocated from memblock, even if stack depot ends up not actually used.
The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit.
This is fine for use-cases such as KASAN which is also a config option
and has overhead on its own. But it's an issue for functionality that
has to be actually enabled on boot (page_owner) or depends on hardware
(GPU drivers) and thus the memory might be wasted. This was raised as
an issue [1] when attempting to add stackdepot support for SLUB's debug
object tracking functionality. It's common to build kernels with
CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or
create only specific kmem caches with debugging for testing purposes.
It would thus be more efficient if stackdepot's table was allocated only
when actually going to be used. This patch thus makes the allocation
(and whole stack_depot_init() call) optional:
- Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current
well-defined point of allocation as part of mem_init(). Make
CONFIG_KASAN select this flag.
- Other users have to call stack_depot_init() as part of their own init
when it's determined that stack depot will actually be used. This may
depend on both config and runtime conditions. Convert current users
which are page_owner and several in the DRM subsystem. Same will be
done for SLUB later.
- Because the init might now be called after the boot-time memblock
allocation has given all memory to the buddy allocator, change
stack_depot_init() to allocate stack_table with kvmalloc() when
memblock is no longer available. Also handle allocation failure by
disabling stackdepot (could have theoretically happened even with
memblock allocation previously), and don't unnecessarily align the
memblock allocation to its own size anymore.
[1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Marco Elver <elver@google.com> # stackdepot
Cc: Marco Elver <elver@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Imran Khan <imran.f.khan@oracle.com>
From: Colin Ian King <colin.king@canonical.com>
Subject: lib/stackdepot: fix spelling mistake and grammar in pr_err message
There is a spelling mistake of the work allocation so fix this and
re-phrase the message to make it easier to read.
Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
From: Vlastimil Babka <vbabka@suse.cz>
Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup
On FLATMEM, we call page_ext_init_flatmem_late() just before
kmem_cache_init() which means stack_depot_init() (called by page owner
init) will not recognize properly it should use kvmalloc() and not
memblock_alloc(). memblock_alloc() will also not issue a warning and
return a block memory that can be invalid and cause kernel page fault when
saving stacks, as reported by the kernel test robot [1].
Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so
that slab_is_available() is true during stack_depot_init(). SPARSEMEM
doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(),
but a different page_ext_init() even later in the boot process.
Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue.
While at it, also actually resolve a checkpatch warning in stack_depot_init()
from DRM CI, which was supposed to be in the original patch already.
[1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
From: Vlastimil Babka <vbabka@suse.cz>
Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3
Due to cd06ab2fd48f ("drm/locking: add backtrace for locking contended
locks without backoff") landing recently to -next adding a new stack depot
user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate
call to stack_depot_init() there as well.
Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Marco Elver <elver@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
From: Vlastimil Babka <vbabka@suse.cz>
Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup4
Due to 4e66934eaadc ("lib: add reference counting tracking
infrastructure") landing recently to net-next adding a new stack depot
user in lib/ref_tracker.c we need to add an appropriate call to
stack_depot_init() there as well.
Link: https://lkml.kernel.org/r/45c1b738-1a2f-5b5f-2f6d-86fab206d01c@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Slab <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
|
|
The prior message is confusing users, which is the exact opposite of the
goal. If the message is being seen, one of the following situations is
happening:
1. the param is misspelled
2. the param is not valid due to the kernel configuration
3. the param is intended for init but isn't after the '--'
delineator on the command line
To make that more clear to the user, explicitly mention "kernel command
line" and also note that the params are still passed to user space to
avoid causing any alarm over params intended for init.
Link: https://lkml.kernel.org/r/20211013223502.96756-1-ahalaney@redhat.com
Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
|
|
Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()
The callers are updated with the below semantic patch and manual
addition of (void *) casting to pointers that are represented by
unsigned long variables.
@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)
[sfr@canb.auug.org.au: fixup]
Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.
But init memory is freed a few lines before setting SYSTEM_RUNNING, so
we have a small period of time when core_kernel_text() is wrong.
Create an intermediate system state called SYSTEM_FREEING_INIT that is
set before starting freeing init memory, and use it in
core_kernel_text() to report init memory invalid earlier.
Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
|
|
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Free unused memblock in a error case to fix memblock leak
in xbc_make_cmdline().
Link: https://lkml.kernel.org/r/163177339181.682366.8713781325929549256.stgit@devnote2
Fixes: 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig for kernel command line")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Avoid using this noisy name and use more calm one.
This is just a name change. No functional change.
Link: https://lkml.kernel.org/r/163187295918.2366983.5231840238429996027.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add xbc_get_info() API which allows user to get the
number of used xbc_nodes and the size of bootconfig
data. This is also useful for checking the bootconfig
is initialized or not.
Link: https://lkml.kernel.org/r/163177340877.682366.4360676589783197627.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Allocate 'xbc_data' in the xbc_init() so that it does
not need to care about the ownership of the copied
data.
Link: https://lkml.kernel.org/r/163177339986.682366.898762699429769117.stgit@devnote2
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Commit f8ade8dddb16 ("xsurf100: drop include of lib8390.c") accidentally
changed init/main.c. Revert that part.
Fixes: f8ade8dddb16 ("xsurf100: drop include of lib8390.c")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The boot-time allocation interface for memblock is a mess, with
'memblock_alloc()' returning a virtual pointer, but then you are
supposed to free it with 'memblock_free()' that takes a _physical_
address.
Not only is that all kinds of strange and illogical, but it actually
causes bugs, when people then use it like a normal allocation function,
and it fails spectacularly on a NULL pointer:
https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
or just random memory corruption if the debug checks don't catch it:
https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
I really don't want to apply patches that treat the symptoms, when the
fundamental cause is this horribly confusing interface.
I started out looking at just automating a sane replacement sequence,
but because of this mix or virtual and physical addresses, and because
people have used the "__pa()" macro that can take either a regular
kernel pointer, or just the raw "unsigned long" address, it's all quite
messy.
So this just introduces a new saner interface for freeing a virtual
address that was allocated using 'memblock_alloc()', and that was kept
as a regular kernel pointer. And then it converts a couple of users
that are obvious and easy to test, including the 'xbc_nodes' case in
lib/bootconfig.c that caused problems.
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- Add migrate-disable counter to tracing header
- Fix error handling in event probes
- Fix missed unlock in osnoise in error path
- Fix merge issue with tools/bootconfig
- Clean up bootconfig data when init memory is removed
- Fix bootconfig to loop only on subkeys
- Have kernel command lines override bootconfig options
- Increase field counts for synthetic events
- Have histograms dynamic allocate event elements to save space
- Fixes in testing and documentation
* tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/boot: Fix to loop on only subkeys
selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events
tracing: Dynamically allocate the per-elt hist_elt_data array
tracing: synth events: increase max fields count
tools/bootconfig: Show whole test command for each test case
bootconfig: Fix missing return check of xbc_node_compose_key function
tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh
docs: bootconfig: Add how to use bootconfig for kernel parameters
init/bootconfig: Reorder init parameter from bootconfig and cmdline
init: bootconfig: Remove all bootconfig data when the init memory is removed
tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads()
tracing: Fix some alloc_event_probe() error handling bugs
tracing: Add migrate-disabled counter to tracing output.
|
|
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
|
|
Reorder the init parameters from bootconfig and kernel cmdline
so that the kernel cmdline always be the last part of the
parameters as below.
" -- "[bootconfig init params][cmdline init params]
This change will help us to prevent that bootconfig init params
overwrite the init params which user gives in the command line.
Link: https://lkml.kernel.org/r/163077085675.222577.5665176468023636160.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Since the bootconfig is used only in the init functions,
it doesn't need to keep the data after boot. Free it when
the init memory is removed.
Link: https://lkml.kernel.org/r/163077084958.222577.5924961258513004428.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
There are some empty trap_init() definitions in different ARCHs, Introduce
a new weak trap_init() function to clean them up.
Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm32]
Acked-by: Vineet Gupta [arc]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <palmerdabbelt@google.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, usermodehelper is enabled right before PID1 starts going
through the initcalls. However, any call of a usermodehelper from a
pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as
there is no filesystem contents yet.
Up until commit e7cb072eb988 ("init/initramfs.c: do unpacking
asynchronously"), such calls, whether via some request_module(), a
legacy uevent "/sbin/hotplug" notification or something else, would
just fail silently with (presumably) -ENOENT from
kernel_execve(). However, that commit introduced the
wait_for_initramfs() synchronization hook which must be called from
the usermodehelper exec path right before the kernel_execve, in order
that request_module() et al done from *after* rootfs_initcall()
time (i.e. device_ and late_ initcalls) would continue to find a
populated initramfs as they used to.
Any call of wait_for_initramfs() done before the unpacking has been
scheduled (i.e. before rootfs_initcall time) must just return
immediately [and let the caller find an empty file system] in order
not to deadlock the machine. I mistakenly thought, and my limited
testing confirmed, that there were no such calls, so I added a
pr_warn_once() in wait_for_initramfs(). It turns out that one can
indeed hit request_module() as well as kobject_uevent_env() during
those early init calls, leading to a user-visible warning in the
kernel log emitted consistently for certain configurations.
We could just remove the pr_warn_once(), but I think it's better to
postpone enabling the usermodehelper framework until there is at least
some chance of finding the executable. That is also a little more
efficient in that a lot of work done in umh.c will be elided. However,
it does change the error seen by those early callers from -ENOENT to
-EBUSY, so there is a risk of a regression if any caller care about
the exact error value.
Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk
Fixes: e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
drivers/ptp/Kconfig:
55c8fca1dae1 ("ptp_pch: Restore dependency on PCI")
e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since the 'bootconfig' command line parameter is handled before
parsing the command line, it doesn't use early_param(). But in
this case, kernel shows a wrong warning message about it.
[ 0.013714] Kernel command line: ro console=ttyS0 bootconfig console=tty0
[ 0.013741] Unknown command line parameters: bootconfig
To suppress this message, add a dummy handler for 'bootconfig'.
Link: https://lkml.kernel.org/r/162812945097.77369.1849780946468010448.stgit@devnote2
Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Now that ax88796.c exports the ax_NS8390_reinit() symbol, we can
include 8390.h instead of lib8390.c, avoiding duplication of that
function and killing a few compile warnings in the bargain.
Fixes: 861928f4e60e826c ("net-next: New ax88796 platform
driver for Amiga X-Surf 100 Zorro board (m68k)")
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Parse the kernel's build ID at initialization so that other code can print
a hex format string representation of the running kernel's build ID. This
will be used in the kdump and dump_stack code so that developers can
easily locate the vmlinux debug symbols for a crash/stacktrace.
[swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()]
Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com
Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Bitmap parsing support for "all" as an alias for all bits
- Documentation updates
- Miscellaneous fixes, including some that overlap into mm and lockdep
- kvfree_rcu() updates
- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers
- RCU NOCB CPU updates, including limited deoffloading
- SRCU updates
- Tasks-RCU updates
- Torture-test updates
* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.
- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
|
|
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
|
|
It is easy to foobar setting a kernel parameter on the command line
without realizing it, there's not much output that you can use to assess
what the kernel did with that parameter by default.
Make it a little more explicit which parameters on the command line
_looked_ like a valid parameter for the kernel, but did not match anything
and ultimately got tossed to init. This is very similar to the unknown
parameter message received when loading a module.
This assumes the parameters are processed in a normal fashion, some
parameters (dyndbg= for example) don't register their parameter with the
rest of the kernel's parameters, and therefore always show up in this list
(and are also given to init - like the rest of this list).
Another example is BOOT_IMAGE= is highlighted as an offender, which it
technically is, but is passed by LILO and GRUB so most systems will see
that complaint.
An example output where "foobared" and "unrecognized" are intentionally
invalid parameters:
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo
Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Move the checksum calculation function into the header for sharing it
with tools/bootconfig.
Link: https://lkml.kernel.org/r/162262197470.264090.16325743685807878807.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
During boot, kernel_init_freeable() initializes `cad_pid` to the init
task's struct pid. Later on, we may change `cad_pid` via a sysctl, and
when this happens proc_do_cad_pid() will increment the refcount on the
new pid via get_pid(), and will decrement the refcount on the old pid
via put_pid(). As we never called get_pid() when we initialized
`cad_pid`, we decrement a reference we never incremented, can therefore
free the init task's struct pid early. As there can be dangling
references to the struct pid, we can later encounter a use-after-free
(e.g. when delivering signals).
This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
have been around since the conversion of `cad_pid` to struct pid in
commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the
pre-KASAN stone age of v2.6.19.
Fix this by getting a reference to the init task's struct pid when we
assign it to `cad_pid`.
Full KASAN splat below.
==================================================================
BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273
CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
ns_of_pid include/linux/pid.h:153 [inline]
task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
do_notify_parent+0x308/0xe60 kernel/signal.c:1950
exit_notify kernel/exit.c:682 [inline]
do_exit+0x2334/0x2bd0 kernel/exit.c:845
do_group_exit+0x108/0x2c8 kernel/exit.c:922
get_signal+0x4e4/0x2a88 kernel/signal.c:2781
do_signal arch/arm64/kernel/signal.c:882 [inline]
do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
work_pending+0xc/0x2dc
Allocated by task 0:
slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
slab_alloc_node mm/slub.c:2907 [inline]
slab_alloc mm/slub.c:2915 [inline]
kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
alloc_pid+0xdc/0xc00 kernel/pid.c:180
copy_process+0x2794/0x5e18 kernel/fork.c:2129
kernel_clone+0x194/0x13c8 kernel/fork.c:2500
kernel_thread+0xd4/0x110 kernel/fork.c:2552
rest_init+0x44/0x4a0 init/main.c:687
arch_call_rest_init+0x1c/0x28
start_kernel+0x520/0x554 init/main.c:1064
0x0
Freed by task 270:
slab_free_hook mm/slub.c:1562 [inline]
slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
slab_free mm/slub.c:3161 [inline]
kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
put_pid+0x30/0x48 kernel/pid.c:109
proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x3ac/0x510 fs/read_write.c:518
vfs_write fs/read_write.c:605 [inline]
vfs_write+0x9c4/0x1018 fs/read_write.c:585
ksys_write+0x124/0x240 fs/read_write.c:658
__do_sys_write fs/read_write.c:670 [inline]
__se_sys_write fs/read_write.c:667 [inline]
__arm64_sys_write+0x78/0xb0 fs/read_write.c:667
__invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701
The buggy address belongs to the object at ffff23794dda0000
which belongs to the cache pid of size 224
The buggy address is located 4 bytes inside of
224-byte region [ffff23794dda0000, ffff23794dda00e0)
The buggy address belongs to the page:
page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
head:(____ptrval____) order:1 compound_mapcount:0
flags: 0x3fffc0000010200(slab|head)
raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
==================================================================
Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially")
to cover the new PF_NO_SETAFFINITY requirement.
While there, move wait_for_completion(&kthreadd_done) into kernel_init()
to make it absolutely clear it is the very first thing done by the init
thread.
Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net
|
|
As pointed out by commit
de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")
init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.
As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().
Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().
Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().
Secondary startups were patched via coccinelle:
@begone@
@@
-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
|
|
Once srcu_init() is called, the SRCU core will make use of delayed
workqueues, which rely on timers. However init_timers() is called
several steps after rcu_init(). This means that a call_srcu() after
rcu_init() but before init_timers() would find itself within a dangerously
uninitialized timer core.
This commit therefore creates a separate call to srcu_init() after
init_timer() completes, which ensures that we stay in early SRCU mode
until timers are safe(r).
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Merge yet more updates from Andrew Morton:
"This is everything else from -mm for this merge window.
90 patches.
Subsystems affected by this patch series: mm (cleanups and slub),
alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat,
checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov,
panic, delayacct, gdb, resource, selftests, async, initramfs, ipc,
drivers/char, and spelling"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits)
mm: fix typos in comments
mm: fix typos in comments
treewide: remove editor modelines and cruft
ipc/sem.c: spelling fix
fs: fat: fix spelling typo of values
kernel/sys.c: fix typo
kernel/up.c: fix typo
kernel/user_namespace.c: fix typos
kernel/umh.c: fix some spelling mistakes
include/linux/pgtable.h: few spelling fixes
mm/slab.c: fix spelling mistake "disired" -> "desired"
scripts/spelling.txt: add "overflw"
scripts/spelling.txt: Add "diabled" typo
scripts/spelling.txt: add "overlfow"
arm: print alloc free paths for address in registers
mm/vmalloc: remove vwrite()
mm: remove xlate_dev_kmem_ptr()
drivers/char: remove /dev/kmem for good
mm: fix some typos and code style problems
ipc/sem.c: mundane typo fixes
...
|
|
Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.
These two patches are independent, but better-together.
The second is a rather trivial patch that simply allows the developer to
change "/sbin/modprobe" to something else - e.g. the empty string, so
that all request_module() during early boot return -ENOENT early, without
even spawning a usermode helper, needlessly synchronizing with the
initramfs unpacking.
The first patch delegates decompressing the initramfs to a worker thread,
allowing do_initcalls() in main.c to proceed to the device_ and late_
initcalls without waiting for that decompression (and populating of
rootfs) to finish. Obviously, some of those later calls may rely on the
initramfs being available, so I've added synchronization points in the
firmware loader and usermodehelper paths - there might be other places
that would need this, but so far no one has been able to think of any
places I have missed.
There's not much to win if most of the functionality needed during boot is
only available as modules. But systems with a custom-made .config and
initramfs can boot faster, partly due to utilizing more than one cpu
earlier, partly by avoiding known-futile modprobe calls (which would still
trigger synchronization with the initramfs unpacking, thus eliminating
most of the first benefit).
This patch (of 2):
Most of the boot process doesn't actually need anything from the
initramfs, until of course PID1 is to be executed. So instead of doing
the decompressing and populating of the initramfs synchronously in
populate_rootfs() itself, push that off to a worker thread.
This is primarily motivated by an embedded ppc target, where unpacking
even the rather modest sized initramfs takes 0.6 seconds, which is long
enough that the external watchdog becomes unhappy that it doesn't get
attention soon enough. By doing the initramfs decompression in a worker
thread, we get to do the device_initcalls and hence start petting the
watchdog much sooner.
Normal desktops might benefit as well. On my mostly stock Ubuntu kernel,
my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
That takes almost two seconds:
[ 0.201454] Trying to unpack rootfs image as initramfs...
[ 1.976633] Freeing initrd memory: 29416K
Before this patch, these lines occur consecutively in dmesg. With this
patch, the timestamps on these two lines is roughly the same as above, but
with 172 lines inbetween - so more than one cpu has been kept busy doing
work that would otherwise only happen after the populate_rootfs()
finished.
Should one of the initcalls done after rootfs_initcall time (i.e., device_
and late_ initcalls) need something from the initramfs (say, a kernel
module or a firmware blob), it will simply wait for the initramfs
unpacking to be done before proceeding, which should in theory make this
completely safe.
But if some driver pokes around in the filesystem directly and not via one
of the official kernel interfaces (i.e. request_firmware*(),
call_usermodehelper*) that theory may not hold - also, I certainly might
have missed a spot when sprinkling wait_for_initramfs(). So there is an
escape hatch in the form of an initramfs_async= command line parameter.
Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New feature:
- A new "func-no-repeats" option in tracefs/options directory.
When set the function tracer will detect if the current function
being traced is the same as the previous one, and instead of
recording it, it will keep track of the number of times that the
function is repeated in a row. And when another function is
recorded, it will write a new event that shows the function that
repeated, the number of times it repeated and the time stamp of
when the last repeated function occurred.
Enhancements:
- In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no
longer needs to waste ring buffer space.
- New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
- No longer limit the PIDs of tasks that are recorded for
"saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
for a much larger range. This caused the mapping of PIDs to the
task names to be dropped for all tasks with a PID greater than
32768.
- Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
- Typos, prototype fixes, and removing of duplicate or unused code.
- Better management of ftrace_page allocations"
* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
ftrace: Reuse the output of the function tracer for func_repeats
tracing: Add "func_no_repeats" option for function tracing
tracing: Unify the logic for function tracing options
tracing: Add method for recording "func_repeats" events
tracing: Add "last_func_repeats" to struct trace_array
tracing: Define new ftrace event "func_repeats"
tracing: Define static void trace_print_time()
ftrace: Simplify the calculation of page number for ftrace_page->records some more
ftrace: Store the order of pages allocated in ftrace_page
tracing: Remove unused argument from "ring_buffer_time_stamp()
tracing: Remove duplicate struct declaration in trace_events.h
tracing: Update create_system_filter() kernel-doc comment
tracing: A minor cleanup for create_system_filter()
kernel: trace: Mundane typo fixes in the file trace_events_filter.c
tracing: Fix various typos in comments
scripts/recordmcount.pl: Make vim and emacs indent the same
scripts/recordmcount.pl: Make indent spacing consistent
tracing: Add a verifier to check string pointers for trace events
...
|
|
mem_init_print_info() is called in mem_init() on each architecture, and
pass NULL argument, so using void argument and move it into mm_init().
Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> [powerpc]
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com> [sparc64]
Acked-by: Russell King <rmk+kernel@armlinux.org.uk> [arm]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.
This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.
This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).
Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).
Reasoning for the feature:
This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):
https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have had to find new ways for their exploits
to work. They have done so, continuing to rely on the kernel's stack
determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
were not relevant. For example, the following recent attacks would have
been hampered if the stack offset was non-deterministic between syscalls:
https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
(page 70: targeting the pt_regs copy with linear stack overflow)
https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
(leaked stack address from one syscall as a target during next syscall)
The main idea is that since the stack offset is randomized on each system
call, it is harder for an attack to reliably land in any particular place
on the thread stack, even with address exposures, as the stack base will
change on the next syscall. Also, since randomization is performed after
placing pt_regs, the ptrace-based approach[1] to discover the randomized
offset during a long-running syscall should not be possible.
Design description:
During most of the kernel's execution, it runs on the "thread stack",
which is pretty deterministic in its structure: it is fixed in size,
and on every entry from userspace to kernel on a syscall the thread
stack starts construction from an address fetched from the per-cpu
cpu_current_top_of_stack variable. The first element to be pushed to the
thread stack is the pt_regs struct that stores all required CPU registers
and syscall parameters. Finally the specific syscall function is called,
with the stack being used as the kernel executes the resulting request.
The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and before the rest of the
thread stack is used during the syscall processing, and to change it every
time a process issues a syscall. The source of randomness is currently
architecture-defined (but x86 is using the low byte of rdtsc()). Future
improvements for different entropy sources is possible, but out of scope
for this patch. Further more, to add more unpredictability, new offsets
are chosen at the end of syscalls (the timing of which should be less
easy to measure from userspace than at syscall entry time), and stored
in a per-CPU variable, so that the life of the value does not stay
explicitly tied to a single task.
As suggested by Andy Lutomirski, the offset is added using alloca()
and an empty asm() statement with an output constraint, since it avoids
changes to assembly syscall entry code, to the unwinder, and provides
correct stack alignment as defined by the compiler.
In order to make this available by default with zero performance impact
for those that don't want it, it is boot-time selectable with static
branches. This way, if the overhead is not wanted, it can just be
left turned off with no performance impact.
The generated assembly for x86_64 with GCC looks like this:
...
ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax
# 12380 <kstack_offset>
ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax
ffffffff81003983: 48 83 c0 0f add $0xf,%rax
ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax
ffffffff8100398c: 48 29 c4 sub %rax,%rsp
ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax
ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
...
As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.
My measure of syscall performance overhead (on x86_64):
lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
randomize_kstack_offset=y Simple syscall: 0.7082 microseconds
randomize_kstack_offset=n Simple syscall: 0.7016 microseconds
So, roughly 0.9% overhead growth for a no-op syscall, which is very
manageable. And for people that don't want this, it's off by default.
There are two gotchas with using the alloca() trick. First,
compilers that have Stack Clash protection (-fstack-clash-protection)
enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
any dynamic stack allocations. While the randomization offset is
always less than a page, the resulting assembly would still contain
(unreachable!) probing routines, bloating the resulting assembly. To
avoid this, -fno-stack-clash-protection is unconditionally added to
the kernel Makefile since this is the only dynamic stack allocation in
the kernel (now that VLAs have been removed) and it is provably safe
from Stack Clash style attacks.
The second gotcha with alloca() is a negative interaction with
-fstack-protector*, in that it sees the alloca() as an array allocation,
which triggers the unconditional addition of the stack canary function
pre/post-amble which slows down syscalls regardless of the static
branch. In order to avoid adding this unneeded check and its associated
performance impact, architectures need to carefully remove uses of
-fstack-protector-strong (or -fstack-protector) in the compilation units
that use the add_random_kstack() macro and to audit the resulting stack
mitigation coverage (to make sure no desired coverage disappears). No
change is visible for this on x86 because the stack protector is already
unconditionally disabled for the compilation unit, but the change is
required on arm64. There is, unfortunately, no attribute that can be
used to disable stack protector for specific functions.
Comparison to PaX RANDKSTACK feature:
The RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. including the location of pt_regs
structure itself on the stack. Initially this patch followed the same
approach, but during the recent discussions[2], it has been determined
to be of a little value since, if ptrace functionality is available for
an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
different offsets in the pt_regs struct, observe the cache behavior of
the pt_regs accesses, and figure out the random stack offset. Another
difference is that the random offset is stored in a per-cpu variable,
rather than having it be per-thread. As a result, these implementations
differ a fair bit in their implementation details and results, though
obviously the intent is similar.
[1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
[2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
[3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
Co-developed-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
|
|
Parameter "cmdline" has no use, drop it.
Link: https://lkml.kernel.org/r/20210311085213.27680-1-jojing64@gmail.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Cao jin <jojing64@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Currently breakpoints in kernel .init.text section are not handled
correctly while allowing to remove them even after corresponding pages
have been freed.
Fix it via killing .init.text section breakpoints just prior to initmem
pages being freed.
Doug: "HW breakpoints aren't handled by this patch but it's probably
not such a big deal".
Link: https://lkml.kernel.org/r/20210224081652.587785-1-sumit.garg@linaro.org
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Suggested-by: Doug Anderson <dianders@chromium.org>
Acked-by: Doug Anderson <dianders@chromium.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a kernel parameter stack_depot_disable to disable stack depot. So
that stack hash table doesn't consume any memory when stack depot is
disabled.
The use case is CONFIG_PAGE_OWNER without page_owner=on. Without this
patch, stackdepot will consume the memory for the hashtable. By default,
it's 8M which is never trivial.
With this option, in CONFIG_PAGE_OWNER configured system, page_owner=off,
stack_depot_disable in kernel command line, we could save the wasted
memory for the hashtable.
[akpm@linux-foundation.org: fix CONFIG_STACKDEPOT=n build]
Link: https://lkml.kernel.org/r/1611749198-24316-2-git-send-email-vjitta@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yogesh Lal <ylal@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors. This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval,
the next allocation through the main allocator (SLAB or SLUB) returns a
guarded allocation from the KFENCE object pool. At this point, the timer
is reset, and the next allocation is set up after the expiration of the
interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.
The KFENCE memory pool is of fixed size, and if the pool is exhausted no
further KFENCE allocations occur. The default config is conservative
with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
pages).
We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) and production server-workload benchmarks that a kernel with
KFENCE (using sample intervals 100-500ms) is performance-neutral
compared to a non-KFENCE baseline kernel.
KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].
For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:
https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst
[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence
This patch (of 9):
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error. To detect out-of-bounds
writes to memory within the object's page itself, KFENCE also uses
pattern-based redzones. The following figure illustrates the page
layout:
---+-----------+-----------+-----------+-----------+-----------+---
| xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
| xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
| x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
| xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
| xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
| xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
---+-----------+-----------+-----------+-----------+-----------+---
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.
For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).
[elver@google.com: fix parameter description for kfence_object_start()]
Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
[elver@google.com: avoid stalling work queue task without allocations]
Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
[elver@google.com: fix potential deadlock due to wake_up()]
Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
[elver@google.com: add option to use KFENCE without static keys]
Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
[elver@google.com: add missing copyright and description headers]
Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Co-developed-by: Marco Elver <elver@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SFI-based platforms are gone. So does this framework.
This removes mention of SFI through the drivers and other code as well.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
On ARCH=um, loading a module doesn't result in its constructors getting
called, which breaks module gcov since the debugfs files are never
registered. On the other hand, in-kernel constructors have already been
called by the dynamic linker, so we can't call them again.
Get out of this conundrum by allowing CONFIG_CONSTRUCTORS to be
selected, but avoiding the in-kernel constructor calls.
Also remove the "if !UML" from GCOV selecting CONSTRUCTORS now, since we
really do want CONSTRUCTORS, just not kernel binary ones.
Link: https://lkml.kernel.org/r/20210120172041.c246a2cac2fb.I1358f584b76f1898373adfed77f4462c8705b736@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 757055ae8dedf5333af17b3b5b4b70ba9bc9da4e.
The commit caused that ttynull was used as the default console
on several systems[1][2][3]. As a result, the console was
blank even when a better alternative existed.
It happened when there was no console configured
on the command line and ttynull_init() was the first initcall
calling register_console().
Or it happened when /dev/ did not exist when console_on_rootfs()
was called. It was not able to open /dev/console even though
a console driver was registered. It tried to add ttynull console
but it obviously did not help. But ttynull became the preferred
console and was used by /dev/console when it was available later.
The commit tried to fix a historical problem that have been there
for ages. The primary motivation was the commit 3cffa06aeef7ece30f6
("printk/console: Allow to disable console output by using console=""
or console=null"). It provided a clean solution for a workaround
that was widely used and worked only by chance.
This revert causes that the console="" or console=null command line
options will again work only by chance. These options will cause that
a particular console will be preferred and the default (tty) ones
will not get enabled. There will be no console registered at
all. As a result there won't be stdin, stdout, and stderr for
the init process. But it worked exactly this way even before.
The proper solution has to fulfill many conditions:
+ Register ttynull only when explicitly required or as
the ultimate fallback.
+ ttynull should get associated with /dev/console but it must
not become preferred console when used as a fallback.
Especially, it must still be possible to replace it
by a better console later.
Such a change requires clean up of the register_console() code.
Otherwise, it would be even harder to follow. Especially, the use
of has_preferred_console and CON_CONSDEV flag is tricky. The clean
up is risky. The ordering of consoles is not well defined. And
any changes tend to break existing user settings.
Do the revert at the least risky solution for now.
[1] https://lore.kernel.org/linux-kselftest/20201221144302.GR4077@smile.fi.intel.com/
[2] https://lore.kernel.org/lkml/d2a3b3c0-e548-7dd1-730f-59bc5c04e191@synopsys.com/
[3] https://patchwork.ozlabs.org/project/linux-um/patch/20210105120128.10854-1-thomas@m3y3r.de/
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Vineet Gupta <vgupta@synopsys.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This is a fix for a regression in the v5.10 merge window, but it was
reported quite late in the v5.10 process, plus generating and testing
the fix took some time.
The regression is due to commit 36dadef23fcc ("kprobes: Init kprobes
in early_initcall") which on powerpc can use RCU Tasks before
initialization, resulting in boot failures.
The fix is straightforward, simply moving initialization of RCU Tasks
before the early_initcall()s. The fix has been exposed to -next and
kbuild test robot testing, and has been tested by the PowerPC guys"
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu-tasks: Move RCU-tasks initialization to before early_initcall()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Finally allow parallel writes and reads into/from the lockless
ringbuffer. But it is not a complete solution. Readers are still
serialized against each other. And nested writes are still prevented
by printk_safe per-CPU buffers.
- Use ttynull as the ultimate fallback for /dev/console.
- Officially allow disabling console output by using console="" or
console=null
- A few code cleanups
* tag 'printk-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: remove logbuf_lock writer-protection of ringbuffer
printk: inline log_output(),log_store() in vprintk_store()
printk: remove obsolete dead assignment
printk/console: Allow to disable console output by using console="" or console=null
init/console: Use ttynull as a fallback when there is no console
printk: ringbuffer: Reference text_data_ring directly in callees.
|
|
init_on_alloc/free parameters
Patch series "cleanup page poisoning", v3.
I have identified a number of issues and opportunities for cleanup with
CONFIG_PAGE_POISON and friends:
- interaction with init_on_alloc and init_on_free parameters depends on
the order of parameters (Patch 1)
- the boot time enabling uses static key, but inefficienty (Patch 2)
- sanity checking is incompatible with hibernation (Patch 3)
- CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have
init_on_free (Patch 4)
- CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we
have init_on_free (Patch 5)
This patch (of 5):
Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1
produces a warning in dmesg that page_poison takes precedence. However,
as these warnings are printed in early_param handlers for
init_on_alloc/free, they are not printed if page_poison is enabled later
on the command line (handlers are called in the order of their
parameters), or when init_on_alloc/free is always enabled by the
respective config option - before the page_poison early param handler is
called, it is not considered to be enabled. This is inconsistent.
We can remove the dependency on order by making the init_on_* parameters
only set a boolean variable, and postponing the evaluation after all early
params have been processed. Introduce a new
init_mem_debugging_and_hardening() function for that, and move the related
debug_pagealloc processing there as well.
As a result init_mem_debugging_and_hardening() knows always accurately if
init_on_* and/or page_poison options were enabled. Thus we can also
optimize want_init_on_alloc() and want_init_on_free(). We don't need to
check page_poisoning_enabled() there, we can instead not enable the
init_on_* static keys at all, if page poisoning is enabled. This results
in a simpler and more effective code.
Link: https://lkml.kernel.org/r/20201113104033.22907-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20201113104033.22907-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mateusz Nosek <mateusznosek0@gmail.com>
Cc: Laura Abbott <labbott@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set,
we have following callchain:
start_kernel
...
mm_init
mem_init
memblock_free_all
reset_all_zones_managed_pages
free_low_memory_core_early
...
buffer_init
nr_free_buffer_pages
zone->managed_pages
...
rest_init
kernel_init
kernel_init_freeable
page_alloc_init_late
kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
wait_for_completion(&pgdat_init_all_done_comp);
...
files_maxfiles_init
It's clear that buffer_init depends on zone->managed_pages, but it's reset
in reset_all_zones_managed_pages after that pages are readded into
zone->managed_pages, but when buffer_init runs this process is half done
and most of them will finally be added till deferred_init_memmap done. In
large memory couting of nr_free_buffer_pages drifts too much, also
drifting from kernels to kernels on same hardware.
Fix is simple, it delays buffer_init run till deferred_init_memmap all
done.
But as corrected by this patch, max_buffer_heads becomes very large, the
value is roughly as many as 4 times of totalram_pages, formula:
max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct
buffer_head));
Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads
turns out to be roughly 67,108,864. In common cases, should a buffer_head
be mapped to one page/block(4KB)? So max_buffer_heads never exceeds
totalram_pages. IMO it's likely to make buffer_heads_over_limit bool
value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in
vmscan unnecessary.
So this patch will change the original behavior related to
buffer_heads_over_limit in vmscan since we used a half done value of
zone->managed_pages before, or should we use a smaller factor(<10%) in
previous formula.
akpm: I think this is OK - the max_buffer_heads code is only needed on
highmem machines, to prevent ZONE_NORMAL from being consumed by large
amounts of buffer_heads attached to highmem pagecache. This problem will
not occur on 64-bit machines, so this feature's non-functionality on such
machines is a feature, not a bug.
Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Page owner of pages used by page owner itself used is missing on arm32
targets. The reason is dummy_handle and failure_handle is not initialized
correctly. Buddy allocator is used to initialize these two handles.
However, buddy allocator is not ready when page owner calls it. This
change fixed that by initializing page owner after buddy initialization.
The working flow before and after this change are:
original logic:
1. allocated memory for page_ext(using memblock).
2. invoke the init callback of page_ext_ops like page_owner(using buddy
allocator).
3. initialize buddy.
after this change:
1. allocated memory for page_ext(using memblock).
2. initialize buddy.
3. invoke the init callback of page_ext_ops like page_owner(using buddy
allocator).
with the change, failure/dummy_handle can get its correct value and page
owner output for example has the one for page owner itself:
Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns
PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0()
init_page_owner+0x28/0x2f8
invoke_init_callbacks_flatmem+0x24/0x34
start_kernel+0x33c/0x5d8
Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.org
Signed-off-by: Zhenhua Huang <zhenhuah@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
PowerPC testing encountered boot failures due to RCU Tasks not being
fully initialized until core_initcall() time. This commit therefore
initializes RCU Tasks (along with Rude RCU and RCU Tasks Trace) just
before early_initcall() time, thus allowing waiting on RCU Tasks grace
periods from early_initcall() handlers.
Link: https://lore.kernel.org/rcu/87eekfh80a.fsf@dja-thinkpad.axtens.net/
Fixes: 36dadef23fcc ("kprobes: Init kprobes in early_initcall")
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
|
|
Load the size and the checksum fields in the footer as le32
instead of u32. This will allow us to apply bootconfig to the
cross build initrd without caring the endianness.
Link: https://lkml.kernel.org/r/160583934457.547349.10504070298990791074.stgit@devnote2
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
stdin, stdout, and stderr standard I/O stream are created for the init
process. They are not available when there is no console registered
for /dev/console. It might lead to a crash when the init process
tries to use them, see the commit 48021f98130880dd742 ("printk: handle
blank console arguments passed in.").
Normally, ttySX and ttyX consoles are used as a fallback when no consoles
are defined via the command line, device tree, or SPCR. But there
will be no console registered when an invalid console name is configured
or when the configured consoles do not exist on the system.
Users even try to avoid the console intentionally, for example,
by using console="" or console=null. It is used on production
systems where the serial port or terminal are not visible to
users. Pushing messages to these consoles would just unnecessary
slowdown the system.
Make sure that stdin, stdout, stderr, and /dev/console are always
available by a fallback to the existing ttynull driver. It has
been implemented for exactly this purpose but it was used only
when explicitly configured.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20201111135450.11214-2-pmladek@suse.com
|
|
Since Grub may align the size of initrd to 4 if user pass
initrd from cpio, we have to check the preceding 3 bytes as well.
Link: https://lkml.kernel.org/r/160520205132.303174.4876760192433315429.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 85c46b78da58 ("bootconfig: Add bootconfig magic word for indicating bootconfig explicitly")
Reported-by: Chen Yu <yu.chen.surf@gmail.com>
Tested-by: Chen Yu <yu.chen.surf@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull more Kunit updates from Shuah Khan:
- add Kunit to kernel_init() and remove KUnit from init calls entirely.
This addresses the concern that Kunit would not work correctly during
late init phase.
- add a linker section where KUnit can put references to its test
suites.
This is the first step in transitioning to dispatching all KUnit
tests from a centralized executor rather than having each as its own
separate late_initcall.
- add a centralized executor to dispatch tests rather than relying on
late_initcall to schedule each test suite separately. Centralized
execution is for built-in tests only; modules will execute tests when
loaded.
- convert bitfield test to use KUnit framework
- Documentation updates for naming guidelines and how
kunit_test_suite() works.
- add test plan to KUnit TAP format
* tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE
lib: kunit: add bitfield test conversion to KUnit
Documentation: kunit: add a brief blurb about kunit_test_suite
kunit: test: add test plan to KUnit TAP format
init: main: add KUnit to kernel init
kunit: test: create a single centralized executor for all tests
vmlinux.lds.h: add linker section for KUnit test suites
Documentation: kunit: Add naming guidelines
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
"The latest advances in computer science from the trivial queue"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
xtensa: fix Kconfig typo
spelling.txt: Remove some duplicate entries
mtd: rawnand: oxnas: cleanup/simplify code
selftests: vm: add fragment CONFIG_GUP_BENCHMARK
perf: Fix opt help text for --no-bpf-event
HID: logitech-dj: Fix spelling in comment
bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG
MAINTAINERS: rectify MMP SUPPORT after moving cputype.h
scif: Fix spelling of EACCES
printk: fix global comment
lib/bitmap.c: fix spello
fs: Fix missing 'bit' in comment
|
|
Although we have not seen any actual examples where KUnit doesn't work
because it runs in the late init phase of the kernel, it has been a
concern for some time that this could potentially be an issue in the
future. So, remove KUnit from init calls entirely, instead call directly
from kernel_init() so that KUnit runs after late init.
Co-developed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
This eliminates the following sparse warning:
init/main.c:306:6: warning: symbol 'xbc_namebuf' was not declared.
Should it be static?
Link: https://lkml.kernel.org/r/20200915070324.2239473-1-yanaijie@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Since kprobe_event= cmdline option allows user to put kprobes on the
functions in initmem, kprobe has to make such probes gone after boot.
Currently the probes on the init functions in modules will be handled
by module callback, but the kernel init text isn't handled.
Without this, kprobes may access non-exist text area to disable or
remove it.
Link: https://lkml.kernel.org/r/159972810544.428528.1839307531600646955.stgit@devnote2
Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter")
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Fix up one typo: CONFIG_BOOTCONFIG -> CONFIG_BOOT_CONFIG
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- The biggest news in that the tracing ring buffer can now time events
that interrupted other ring buffer events.
Before this change, if an interrupt came in while recording another
event, and that interrupt also had an event, those events would all
have the same time stamp as the event it interrupted.
Now, with the new design, those events will have a unique time stamp
and rightfully display the time for those events that were recorded
while interrupting another event.
- Bootconfig how has an "override" operator that lets the users have a
default config, but then add options to override the default.
- A fix was made to properly filter function graph tracing to the
ftrace PIDs. This came in at the end of the -rc cycle, and needs to
be backported.
- Several clean ups, performance updates, and minor fixes as well.
* tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits)
tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
tracing: Use trace_sched_process_free() instead of exit() for pid tracing
bootconfig: Fix to find the initargs correctly
Documentation: bootconfig: Add bootconfig override operator
tools/bootconfig: Add testcases for value override operator
lib/bootconfig: Add override operator support
kprobes: Remove show_registers() function prototype
tracing/uprobe: Remove dead code in trace_uprobe_register()
kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
ftrace: Fix ftrace_trace_task return value
tracepoint: Use __used attribute definitions from compiler_attributes.h
tracepoint: Mark __tracepoint_string's __used
trace : Have tracing buffer info use kvzalloc instead of kzalloc
tracing: Remove outdated comment in stack handling
ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
ftrace: Setup correct FTRACE_FL_REGS flags for module
tracing/hwlat: Honor the tracing_cpumask
tracing/hwlat: Drop the duplicate assignment in start_kthread()
tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
...
|
|
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
|
|
This patch prepares Software Tag-Based KASAN for stack tagging support.
With stack tagging enabled, KASAN tags stack variable in each function in
its prologue. In start_kernel() stack variables get tagged before KASAN
is enabled via setup_arch()->kasan_init(). As the result the tags for
start_kernel()'s stack variables end up in the temporary shadow memory.
Later when KASAN gets enabled, switched to normal shadow, and starts
checking tags, this leads to false-positive reports, as proper tags are
missing in normal shadow.
Disable KASAN instrumentation for start_kernel(). Also disable it for
arm64's setup_arch() as a precaution (it doesn't have any stack variables
right now).
[andreyknvl@google.com: reorder attributes for start_kernel()]
Link: http://lkml.kernel.org/r/26fb6165a17abcf61222eda5184c030fb6b133d1.1596544734.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Elena Petrova <lenaptr@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: http://lkml.kernel.org/r/55d432671a92e931ab8234b03dc36b14d4c21bfb.1596199677.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
"Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"
* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
init: add an init_dup helper
init: add an init_utimes helper
init: add an init_stat helper
init: add an init_mknod helper
init: add an init_mkdir helper
init: add an init_symlink helper
init: add an init_link helper
init: add an init_eaccess helper
init: add an init_chmod helper
init: add an init_chown helper
init: add an init_chroot helper
init: add an init_chdir helper
init: add an init_rmdir helper
init: add an init_unlink helper
init: add an init_umount helper
init: add an init_mount helper
init: mark create_dev as __init
init: mark console_on_rootfs as __init
init: initialize ramdisk_execute_command at compile time
devtmpfs: refactor devtmpfsd()
...
|
|
Add a simple helper to grab a reference to a file and install it at
the next available fd, and switch the early init code over to it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.
This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.
- Implement kernel_execve
- Make the user mode driver code a better citizen
With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...
|
|
Since the parse_args() stops parsing at '--', bootconfig_params()
will never get the '--' as param and initargs_found never be true.
In the result, if we pass some init arguments via the bootconfig,
those are always appended to the kernel command line with '--'
even if the kernel command line already has '--'.
To fix this correctly, check the return value of parse_args()
and set initargs_found true if the return value is not an error
but a valid address.
Link: https://lkml.kernel.org/r/159650953285.270383.14822353843556363851.stgit@devnote2
Fixes: f61872bb58a1 ("bootconfig: Use parse_args() to find bootconfig and '--'")
Cc: stable@vger.kernel.org
Reported-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it. Note that this
theoretically changes behavior as it always is based on the effective
permissions. But during early init that doesn't make a difference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This helper is only used for the early init code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Set ramdisk_execute_command to "/init" at compile time. The command
line can still override it, but this saves a few instructions and
removes a NULL check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Don't rely on the implicit set_fs(KERNEL_DS) for ksys_open to work, but
instead open a struct file for /dev/console and then install it as FD
0/1/2 manually.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve. The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.
The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.
The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.
The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances. In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.
Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
In addition to -ftrivial-auto-var-init=pattern (used by
CONFIG_INIT_STACK_ALL now) Clang also supports zero initialization for
locals enabled by -ftrivial-auto-var-init=zero. The future of this flag
is still being debated (see https://bugs.llvm.org/show_bug.cgi?id=45497).
Right now it is guarded by another flag,
-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang,
which means it may not be supported by future Clang releases. Another
possible resolution is that -ftrivial-auto-var-init=zero will persist
(as certain users have already started depending on it), but the name
of the guard flag will change.
In the meantime, zero initialization has proven itself as a good
production mitigation measure against uninitialized locals. Unlike pattern
initialization, which has a higher chance of triggering existing bugs,
zero initialization provides safe defaults for strings, pointers, indexes,
and sizes. On the other hand, pattern initialization remains safer for
return values. Chrome OS and Android are moving to using zero
initialization for production builds.
Performance-wise, the difference between pattern and zero initialization
is usually negligible, although the generated code for zero
initialization is more compact.
This patch renames CONFIG_INIT_STACK_ALL to CONFIG_INIT_STACK_ALL_PATTERN
and introduces another config option, CONFIG_INIT_STACK_ALL_ZERO, that
enables zero initialization for locals if the corresponding flags are
supported by Clang.
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20200616083435.223038-1-glider@google.com
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Merge the state of the locking kcsan branch before the read/write_once()
and the atomics modifications got merged.
Squash the fallout of the rebase on top of the read/write once and atomic
fallback work into the merge. The history of the original branch is
preserved in tag locking-kcsan-2020-06-02.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Patch series "support setting sysctl parameters from kernel command line", v3.
This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1
The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.
I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:
- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.
- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value
- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible
- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag
This patch (of 5):
A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.
Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.
Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.
Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.
The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.
Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.
[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some init systems (eg. systemd) have init at their own paths, for
example, /usr/lib/systemd/systemd. A compatibility symlink to one of the
hardcoded init paths is provided by another package, usually named
something like systemd-sysvcompat or similar.
Currently distro maintainers who are hands-off on the bootloader are more
or less required to include those compatibility links as part of their
base distribution, because it's hard to migrate away from them since
there's a risk some users will not get the message to set init= on the
kernel command line appropriately.
Moreover, for distributions where the init system is something the
distribution itself is opinionated about (eg. Arch, which has systemd in
the required `base` package), we could usually reasonably configure this
ahead of time when building the distribution kernel. However, we
currently simply don't have any way to configure the kernel to do this.
Here's an example discussion where removing sysvcompat was discussed by
distro maintainers[0].
This patch adds a new Kconfig tunable, CONFIG_DEFAULT_INIT, which if set
is tried before the hardcoded fallback list. So the order of precedence
is now thus:
1. init= on command line (on failure: panic)
2. CONFIG_DEFAULT_INIT (on failure: try #3)
3. Hardcoded fallback list (on failure: panic)
This new config parameter will allow distribution maintainers to move away
from these compatibility links safely, without having to worry that their
users might not have the right init=.
There are also two other benefits of this over having the distribution
maintain a symlink:
1. One of the value propositions over simply having distributions
maintain a /sbin/init symlink via a package is that it also frees
distributions which have a preferred default, but not mandatory, init
system from having their package manager fight with their users for
control of /{s,}bin/init. Instead, the distribution simply makes
their preference known in CONFIG_DEFAULT_INIT, and if the user
installs another init system and uninstalls the default one they can
still make use of /{s,}bin/init and friends for their own uses. This
makes more cases Just Work(tm) without the user having to perform
extra configuration via init=.
2. Since before this we don't know which path the distribution actually
_intends_ to serve init from, we don't pr_err if it is simply
missing, and usually will just silently put the user in a /bin/sh
shell. Now that the distribution can make a declaration of intent, we
can be more vocal when this init system fails to launch for any
reason, even if it's simply because no file exists at that location,
speeding up the palaver of init/mount dependency/etc debugging a bit.
[0]: https://lists.archlinux.org/pipermail/arch-dev-public/2019-January/029435.html
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: http://lkml.kernel.org/r/20200522160234.GA1487022@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().
The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for early boot crashes of kernels built with gcc10 and
stack protector enabled"
* tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix early boot crash on gcc-10, third try
|
|
... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.
The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:
Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139
Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
Call Trace:
dump_stack
panic
? start_secondary
__stack_chk_fail
start_secondary
secondary_startup_64
-—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.
To fix that, the initial attempt was to mark the one function which
generates the stack canary with:
__attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.
The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.
The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").
This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.
That should hold for a long time, but hey we said that about the other
two solutions too so...
Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
|
|
Commit de462e5f1071 ("bootconfig: Fix to remove bootconfig
data from initrd while boot") causes a cosmetic regression
on dmesg, which warns "no bootconfig data" message without
bootconfig cmdline option.
Fix setup_boot_config() by moving no bootconfig check after
commandline option check.
Link: http://lkml.kernel.org/r/9b1ba335-071d-c983-89a4-2677b522dcc8@molgen.mpg.de
Link: http://lkml.kernel.org/r/158916116468.21787.14558782332170588206.stgit@devnote2
Fixes: de462e5f1071 ("bootconfig: Fix to remove bootconfig data from initrd while boot")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
If there is a bootconfig data in the tail of initrd/initramfs,
initrd image sanity check caused an error while decompression
stage as follows.
[ 0.883882] Unpacking initramfs...
[ 2.696429] Initramfs unpacking failed: invalid magic at start of compressed archive
This error will be ignored if CONFIG_BLK_DEV_RAM=n,
but CONFIG_BLK_DEV_RAM=y the kernel failed to mount rootfs
and causes a panic.
To fix this issue, shrink down the initrd_end for removing
tailing bootconfig data while boot the kernel.
Link: http://lkml.kernel.org/r/158788401014.24243.17424755854115077915.stgit@devnote2
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 7684b8582c24 ("bootconfig: Load boot config from the tail of initrd")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Resolve these conflicts:
arch/x86/Kconfig
arch/x86/kernel/Makefile
Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
printk_deferred(), similarly to printk_safe/printk_nmi, does not
immediately attempt to print a new message on the consoles, avoiding
calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
which potentially can deadlock the system.
Those printk() flavors, instead, rely on per-CPU flush irq_work to print
messages from safer contexts. For same reasons (recursive scheduler or
timekeeping calls) printk() uses per-CPU irq_work in order to wake up
user space syslog/kmsg readers.
However, only printk_safe/printk_nmi do make sure that per-CPU areas
have been initialised and that it's safe to modify per-CPU irq_work.
This means that, for instance, should printk_deferred() be invoked "too
early", that is before per-CPU areas are initialised, printk_deferred()
will perform illegal per-CPU access.
Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random:
silence a lockdep splat with printk()") user-space syslog/kmsg readers
are not able to read new kernel messages.
The reason is printk_deferred() being called too early (as was pointed
out by Petr and John).
Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
areas are initialized.
Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/
Reported-by: Lech Perczak <l.perczak@camlintechnologies.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Jann Horn <jannh@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Conflicts:
arch/x86/purgatory/Makefile
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Show line and column when we got a parse error in bootconfig tool.
Current lib/bootconfig shows the parse error with byte offset, but
that is not human readable.
This makes xbc_init() not showing error message itself but able to
pass the error message and position to caller, so that the caller
can decode it and show the error message with line number and columns.
With this patch, bootconfig tool shows an error with line:column as
below.
$ cat samples/bad-dotword.bconf
# do not start keyword with .
key {
.word = 1
}
$ ./bootconfig -a samples/bad-dotword.bconf initrd
Parse Error: Invalid keyword at 3:3
Link: http://lkml.kernel.org/r/158323469002.10560.4023923847704522760.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing and bootconfig updates:
"Fixes and changes to bootconfig before it goes live in a release.
Change in API of bootconfig (before it comes live in a release):
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure
Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug
Fix of trace_printk() breaking postponed start up tests
Make a function static that is only used in a single file"
* tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
bootconfig: Add append value operator support
bootconfig: Prohibit re-defining value on same key
bootconfig: Print array as multiple commands for legacy command line
bootconfig: Reject subkey and value on same parent key
tools/bootconfig: Remove unneeded error message silencer
bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
bootconfig: Set CONFIG_BOOT_CONFIG=n by default
tracing: Clear trace_state when starting trace
bootconfig: Mark boot_config_checksum() static
tracing: Disable trace_printk() on post poned tests
tracing: Have synthetic event test use raw_smp_processor_id()
tracing: Fix number printing bug in print_synth_event()
tracing: Check that number of vals matches number of synth event fields
tracing: Make synth_event trace functions endian-correct
tracing: Make sure synth_event_trace() example always uses u64
|
|
Print arraied values as multiple same options for legacy
kernel command line. With this rule, if the "kernel.*" and
"init.*" array entries in bootconfig are printed out as
multiple same options, e.g.
kernel {
console = "ttyS0,115200"
console += "tty0"
}
will be correctly converted to
console="ttyS0,115200" console="tty0"
in the kernel command line.
Link: http://lkml.kernel.org/r/158220118213.26565.8163300497009463916.stgit@devnote2
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Add bootconfig magic word to the end of bootconfig on initrd
image for indicating explicitly the bootconfig is there.
Also tools/bootconfig treats wrong size or wrong checksum or
parse error as an error, because if there is a bootconfig magic
word, there must be a bootconfig.
The bootconfig magic word is "#BOOTCONFIG\n", 12 bytes word.
Thus the block image of the initrd file with bootconfig is
as follows.
[Initrd][bootconfig][size][csum][#BOOTCONFIG\n]
Link: http://lkml.kernel.org/r/158220112263.26565.3944814205960612841.stgit@devnote2
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Set CONFIG_BOOT_CONFIG=n by default. This also warns
user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
in the kernel command line.
Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
In fact, this function is only used in this file, so mark it with 'static'.
Link: http://lkml.kernel.org/r/1581852511-14163-1-git-send-email-hqjagain@gmail.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes:
- Fix an uninitialized variable
- Fix compile bug to bootconfig userspace tool (in tools directory)
- Suppress some error messages of bootconfig userspace tool
- Remove unneded CONFIG_LIBXBC from bootconfig
- Allocate bootconfig xbc_nodes dynamically. To ease complaints about
taking up static memory at boot up
- Use of parse_args() to parse bootconfig instead of strstr() usage
Prevents issues of double quotes containing the interested string
- Fix missing ring_buffer_nest_end() on synthetic event error path
- Return zero not -EINVAL on soft disabled synthetic event (soft
disabling must be the same as hard disabling, which returns zero)
- Consolidate synthetic event code (remove duplicate code)"
* tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Consolidate trace() functions
tracing: Don't return -EINVAL when tracing soft disabled synth events
tracing: Add missing nest end to synth_event_trace_start() error case
tools/bootconfig: Suppress non-error messages
bootconfig: Allocate xbc_nodes array dynamically
bootconfig: Use parse_args() to find bootconfig and '--'
tracing/kprobe: Fix uninitialized variable bug
bootconfig: Remove unneeded CONFIG_LIBXBC
tools/bootconfig: Fix wrong __VA_ARGS__ usage
|
|
The current implementation does a naive search of "bootconfig" on the kernel
command line. But this could find "bootconfig" that is part of another
option in quotes (although highly unlikely). But it also needs to find '--'
on the kernel command line to know if it should append a '--' or not when a
bootconfig in the initrd file has an "init" section. The check uses the
naive strstr() to find to see if it exists. But this can return a false
positive if it exists in an option and then the "init" section in the initrd
will not be appended properly.
Using parse_args() to find both of these will solve both of these problems.
Link: https://lore.kernel.org/r/202002070954.C18E7F58B@keescook
Fixes: 7495e0926fdf3 ("bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline")
Fixes: 1319916209ce8 ("bootconfig: init: Allow admin to use bootconfig for init command line")
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added new "bootconfig".
This looks for a file appended to initrd to add boot config options,
and has been discussed thoroughly at Linux Plumbers.
Very useful for adding kprobes at bootup.
Only enabled if "bootconfig" is on the real kernel command line.
- Created dynamic event creation.
Merges common code between creating synthetic events and kprobe
events.
- Rename perf "ring_buffer" structure to "perf_buffer"
- Rename ftrace "ring_buffer" structure to "trace_buffer"
Had to rename existing "trace_buffer" to "array_buffer"
- Allow trace_printk() to work withing (some) tracing code.
- Sort of tracing configs to be a little better organized
- Fixed bug where ftrace_graph hash was not being protected properly
- Various other small fixes and clean ups
* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
bootconfig: Show the number of nodes on boot message
tools/bootconfig: Show the number of bootconfig nodes
bootconfig: Add more parse error messages
bootconfig: Use bootconfig instead of boot config
ftrace: Protect ftrace_graph_hash with ftrace_sync
ftrace: Add comment to why rcu_dereference_sched() is open coded
tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
tracing: Annotate ftrace_graph_hash pointer with __rcu
bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
tracing: Use seq_buf for building dynevent_cmd string
tracing: Remove useless code in dynevent_arg_pair_add()
tracing: Remove check_arg() callbacks from dynevent args
tracing: Consolidate some synth_event_trace code
tracing: Fix now invalid var_ref_vals assumption in trace action
tracing: Change trace_boot to use synth_event interface
tracing: Move tracing selftests to bottom of menu
tracing: Move mmio tracer config up with the other tracers
tracing: Move tracing test module configs together
tracing: Move all function tracing configs together
tracing: Documentation for in-kernel synthetic event API
...
|
|
Show the number of bootconfig nodes on boot message.
Link: http://lkml.kernel.org/r/158091062297.27924.9051634676068550285.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Use "bootconfig" (1 word) instead of "boot config" (2 words)
in the boot message.
Link: http://lkml.kernel.org/r/158091059459.27924.14414336187441539879.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
As the bootconfig is appended to the initrd it is not as easy to modify as
the kernel command line. If there's some issue with the kernel, and the
developer wants to boot a pristine kernel, it should not be needed to modify
the initrd to remove the bootconfig for a single boot.
As bootconfig is silently added (if the admin does not know where to look
they may not know it's being loaded). It should be explicitly added to the
kernel cmdline. The loading of the bootconfig is only done if "bootconfig"
is on the kernel command line. This will let admins know that the kernel
command line is extended.
Note, after adding printk()s for when the size is too great or the checksum
is wrong, exposed that the current method always looked for the boot config,
and if this size and checksum matched, it would parse it (as if either is
wrong a printk has been added to show this). It's better to only check this
if the boot config is asked to be looked for.
Link: https://lore.kernel.org/r/CAHk-=wjfjO+h6bQzrTf=YCZA53Y3EDyAs3Z4gEsT7icA3u_Psw@mail.gmail.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|